1997年春季学期,我们教授了一门基于Linux 2.0的操作系统课程。这个想法是鼓励学生阅读源代码。为了实现这一目标,我们分配了术语项目,其中包括对内核进行更改并对修改后的版本进行测试。我们还为学生编写了有关 Linux 的一些关键功能(例如任务切换和任务调度)的课程笔记。
In the spring semester of 1997, we taught a course on operating systems based on Linux 2.0. The idea was to encourage students to read the source code. To achieve this, we assigned term projects consisting of making changes to the kernel and performing tests on the modified version. We also wrote course notes for our students about a few critical features of Linux such as task switching and task scheduling.
在这项工作的基础上,在 O'Reilly 编辑 Andy Oram 的大力支持下,《Understanding the Linux Kernel》于 2000 年底出版了第一版,其中涵盖了 Linux 2.2,并对 Linux 2.4 做出了一些预期。这本书所取得的成功鼓励我们继续沿着这条路走下去。2002 年底,我们推出了涵盖 Linux 2.4 的第二版。您现在看到的是第三版,其中涵盖了 Linux 2.6。
Out of this work — and with a lot of support from our O'Reilly editor Andy Oram — came the first edition of Understanding the Linux Kernel at the end of 2000, which covered Linux 2.2 with a few anticipations on Linux 2.4. The success encountered by this book encouraged us to continue along this line. At the end of 2002, we came out with a second edition covering Linux 2.4. You are now looking at the third edition, which covers Linux 2.6.
与我们之前的经验一样,我们阅读了数千行代码,试图理解它们。经过所有这些工作,我们可以说这是值得的。我们学到了很多书本上找不到的东西,我们希望在接下来的几页中能够成功地传达其中的一些信息。
As in our previous experiences, we read thousands of lines of code, trying to make sense of them. After all this work, we can say that it was worth the effort. We learned a lot of things you don't find in books, and we hope we have succeeded in conveying some of this information in the following pages.
所有对 Linux 的工作原理及其为何如此高效感到好奇的人都可以在这里找到答案。读完本书后,您将通过数千行代码找到方法,区分关键数据结构和次要数据结构——简而言之,成为一名真正的 Linux 黑客。
All people curious about how Linux works and why it is so efficient will find answers here. After reading the book, you will find your way through the many thousands of lines of code, distinguishing between crucial data structures and secondary ones—in short, becoming a true Linux hacker.
我们的工作可以被认为是 Linux 内核的导览:讨论了内核中使用的大多数重要数据结构以及许多算法和编程技巧。在许多情况下,相关的代码片段是逐行讨论的。当然,您手头应该有 Linux 源代码,并且应该愿意花费一些精力来破译一些为了简洁起见而没有完全描述的功能。
Our work might be considered a guided tour of the Linux kernel: most of the significant data structures and many algorithms and programming tricks used in the kernel are discussed. In many cases, the relevant fragments of code are discussed line by line. Of course, you should have the Linux source code on hand and should be willing to expend some effort deciphering some of the functions that are not, for sake of brevity, fully described.
在另一个层面上,本书为那些想要更多地了解现代操作系统中关键设计问题的人们提供了宝贵的见解。它不是专门针对系统管理员或程序员的;它主要适合那些想要了解机器内部如何工作的人!与任何优秀的指南一样,我们试图超越肤浅的特征。我们提供背景信息,例如主要功能的历史以及使用它们的原因。
On another level, the book provides valuable insight to people who want to know more about the critical design issues in a modern operating system. It is not specifically addressed to system administrators or programmers; it is mostly for people who want to understand how things really work inside the machine! As with any good guide, we try to go beyond superficial features. We offer a background, such as the history of major features and the reasons why they were used.
当我们开始写这本书时,我们面临着一个关键的决定:我们应该参考特定的硬件平台,还是跳过与硬件相关的细节,专注于内核中纯粹与硬件无关的部分?
When we began to write this book, we were faced with a critical decision: should we refer to a specific hardware platform or skip the hardware-dependent details and concentrate on the pure hardware-independent parts of the kernel?
其他有关 Linux 内核内部结构的书籍选择了后一种方法;我们决定采用前一种,原因如下:
Others books on Linux kernel internals have chosen the latter approach; we decided to adopt the former one for the following reasons:
高效的内核利用大多数可用的硬件功能,例如寻址技术、高速缓存、处理器异常、特殊指令、处理器控制寄存器等。如果我们想让你相信内核在执行特定任务时确实做得很好,我们必须首先告诉你来自硬件什么样的支持。
Efficient kernels take advantage of most available hardware features, such as addressing techniques, caches, processor exceptions, special instructions, processor control registers, and so on. If we want to convince you that the kernel indeed does quite a good job in performing a specific task, we must first tell what kind of support comes from the hardware.
即使 Unix 内核源代码的很大一部分是与处理器无关的并且用 C 语言编码,但一小部分关键部分是用汇编语言编码的。因此,要全面了解内核,需要研究一些与硬件交互的汇编语言片段。
Even if a large portion of a Unix kernel source code is processor-independent and coded in C language, a small and critical part is coded in assembly language. A thorough knowledge of the kernel, therefore, requires the study of a few assembly language fragments that interact with the hardware.
在涵盖硬件功能时,我们的策略非常简单:只勾勒出完全由硬件驱动的功能,同时详细说明那些需要软件支持的功能。事实上,我们感兴趣的是内核设计而不是计算机体系结构。
When covering hardware features, our strategy is quite simple: only sketch the features that are totally hardware-driven while detailing those that need some software support. In fact, we are interested in kernel design rather than in computer architecture.
我们选择路径的下一步包括选择要描述的计算机系统。尽管 Linux 现在在多种个人计算机和工作站上运行,但我们决定将重点放在非常流行且廉价的 IBM 兼容个人计算机上,从而关注这些个人计算机中包含的 80 × 86 微处理器和一些支持芯片。在接下来的章节中,术语 80 × 86 微处理器将用于表示 Intel 80386、80486、Pentium、Pentium Pro、Pentium II、Pentium III 和 Pentium 4 微处理器或兼容型号。在少数情况下,将明确引用特定模型。
Our next step in choosing our path consisted of selecting the computer system to describe. Although Linux is now running on several kinds of personal computers and workstations, we decided to concentrate on the very popular and cheap IBM-compatible personal computers—and thus on the 80 × 86 microprocessors and on some support chips included in these personal computers. The term 80 × 86 microprocessor will be used in the forthcoming chapters to denote the Intel 80386, 80486, Pentium, Pentium Pro, Pentium II, Pentium III, and Pentium 4 microprocessors or compatible models. In a few cases, explicit references will be made to specific models.
我们必须做出的另一项选择是学习 Linux 组件时遵循的顺序。我们尝试了一种自下而上的方法:从依赖硬件的主题开始,以完全独立于硬件的主题结束。事实上,我们将在本书的第一部分中多次引用 80 × 86 微处理器,而其余部分则相对与硬件无关。第 13 章和第14 章有重要的例外情况 。实际上,遵循自下而上的方法并不像看起来那么简单,因为内存管理、进程管理和文件系统领域是相互交织的;一些前向引用(即对尚未解释的主题的引用)是不可避免的。
One more choice we had to make was the order to follow in studying Linux components. We tried a bottom-up approach: start with topics that are hardware-dependent and end with those that are totally hardware-independent. In fact, we'll make many references to the 80 × 86 microprocessors in the first part of the book, while the rest of it is relatively hardware-independent. Significant exceptions are made in Chapter 13 and Chapter 14. In practice, following a bottom-up approach is not as simple as it looks, because the areas of memory management, process management, and filesystems are intertwined; a few forward references—that is, references to topics yet to be explained—are unavoidable.
每章都以所涵盖主题的理论概述开始。然后根据自下而上的方法呈现材料。我们从支持本章中描述的功能所需的数据结构开始。然后,我们通常从最低级别的功能转向更高级别的功能,最后通常显示如何支持用户应用程序发出的系统调用。
Each chapter starts with a theoretical overview of the topics covered. The material is then presented according to the bottom-up approach. We start with the data structures needed to support the functionalities described in the chapter. Then we usually move from the lowest level of functions to higher levels, often ending by showing how system calls issued by user applications are supported.
所有受支持架构的 Linux 源代码包含在大约 1000 个子目录中的 14,000 多个 C 和汇编语言文件中;它由大约 600 万行代码组成,占用超过 230 MB 的磁盘空间。当然,本书只能涵盖该代码的一小部分。为了弄清楚 Linux 源代码有多大,请考虑一下您正在阅读的书的整个源代码占用的空间不到 3 MB。因此,我们需要超过 75 本这样的书来列出所有代码,甚至不加注释!
Linux source code for all supported architectures is contained in more than 14,000 C and assembly language files stored in about 1000 subdirectories; it consists of roughly 6 million lines of code, which occupy over 230 megabytes of disk space. Of course, this book can cover only a very small portion of that code. Just to figure out how big the Linux source is, consider that the whole source code of the book you are reading occupies less than 3 megabytes. Therefore, we would need more than 75 books like this to list all code, without even commenting on it!
所以我们必须对要描述的部分做出一些选择。这是对我们决定的粗略评估:
So we had to make some choices about the parts to describe. This is a rough assessment of our decisions:
我们相当彻底地描述了进程和内存管理。
We describe process and memory management fairly thoroughly.
我们介绍了虚拟文件系统以及 Ext2 和 Ext3 文件系统,尽管只是提到了许多功能而没有详细介绍代码;我们不讨论 Linux 支持的其他文件系统。
We cover the Virtual Filesystem and the Ext2 and Ext3 filesystems, although many functions are just mentioned without detailing the code; we do not discuss other filesystems supported by Linux.
我们描述了设备驱动程序,就内核接口而言,设备驱动程序大约占内核的 50%,但不尝试分析每个特定的驱动程序。
We describe device drivers, which account for roughly 50% of the kernel, as far as the kernel interface is concerned, but do not attempt analysis of each specific driver.
本书介绍了Linux内核的官方2.6.11版本,可以从网站http://www.kernel.org下载。
The book describes the official 2.6.11 version of the Linux kernel, which can be downloaded from the web site http://www.kernel.org.
请注意,大多数 GNU/Linux 发行版都会修改官方内核以实现新功能或提高其效率。在某些情况下,您最喜欢的发行版提供的源代码可能与本书中描述的源代码有很大不同。
Be aware that most distributions of GNU/Linux modify the official kernel to implement new features or to improve its efficiency. In a few cases, the source code provided by your favorite distribution might differ significantly from the one described in this book.
在许多情况下,我们会显示以更易于阅读但效率较低的方式重写的原始代码片段。这种情况发生在时间关键点,此时程序的各个部分通常是用手工优化的 C 代码和汇编代码混合编写的。再次强调,我们的目的是为学习原始 Linux 代码提供一些帮助。
In many cases, we show fragments of the original code rewritten in an easier-to-read but less efficient way. This occurs at time-critical points at which sections of programs are often written in a mixture of hand-optimized C and assembly code. Once again, our aim is to provide some help in studying the original Linux code.
在讨论内核代码时,我们经常会描述 Unix 程序员听说过并且可能好奇的许多熟悉功能的基础(共享和映射内存、信号、管道、符号链接等)。
While discussing kernel code, we often end up describing the underpinnings of many familiar features that Unix programmers have heard of and about which they may be curious (shared and mapped memory, signals, pipes, symbolic links, and so on).
为了让生活更轻松,第 1 章“简介”概述了 Unix 内核的内部结构以及 Linux 如何与其他著名的 Unix 系统竞争。
To make life easier, Chapter 1, Introduction, presents a general picture of what is inside a Unix kernel and how Linux competes against other well-known Unix systems.
任何 Unix 内核的核心都是内存管理。第 2 章,内存寻址,解释了 80 × 86 处理器如何包含特殊电路来对内存中的数据进行寻址,以及 Linux 如何利用它们。
The heart of any Unix kernel is memory management. Chapter 2, Memory Addressing, explains how 80 × 86 processors include special circuits to address data in memory and how Linux exploits them.
进程是 Linux 提供的基本抽象,在第 3 章“ 进程”中介绍。在这里,我们还解释了每个进程如何在非特权用户模式或特权内核模式下运行。用户模式和内核模式之间的转换只能通过称为中断和异常的完善硬件机制来发生 。这些内容将在第 4 章“中断和异常”中介绍。
Processes are a fundamental abstraction offered by Linux and are introduced in Chapter 3, Processes. Here we also explain how each process runs either in an unprivileged User Mode or in a privileged Kernel Mode. Transitions between User Mode and Kernel Mode happen only through well-established hardware mechanisms called interrupts and exceptions. These are introduced in Chapter 4, Interrupts and Exceptions.
在许多情况下,内核必须处理来自不同设备和处理器的突发中断信号。需要同步机制,以便内核可以以交错的方式服务所有这些请求:它们将在第 5 章“内核同步”中讨论,适用于单处理器和多处理器系统。
In many occasions, the kernel has to deal with bursts of interrupt signals coming from different devices and processors. Synchronization mechanisms are needed so that all these requests can be serviced in an interleaved way by the kernel: they are discussed in Chapter 5, Kernel Synchronization, for both uniprocessor and multiprocessor systems.
一种类型的中断对于让 Linux 处理运行时间至关重要;更多详细信息请参见第 6 章“时序测量”。
One type of interrupt is crucial for allowing Linux to take care of elapsed time; further details can be found in Chapter 6, Timing Measurements.
第 7 章, 进程调度,解释了 Linux 如何依次执行系统中的每个活动进程,以便所有进程都能完成任务。
Chapter 7, Process Scheduling, explains how Linux executes, in turn, every active process in the system so that all of them can progress toward their completions.
接下来我们再次关注内存。第 8 章,内存管理,描述了处理系统中最宝贵的资源(当然除了处理器之外)所需的复杂技术:可用内存。必须将此资源授予 Linux 内核和用户应用程序。第 9 章,进程地址空间,展示了内核如何处理贪婪应用程序发出的内存请求。
Next we focus again on memory. Chapter 8, Memory Management, describes the sophisticated techniques required to handle the most precious resource in the system (besides the processors, of course): available memory. This resource must be granted both to the Linux kernel and to the user applications. Chapter 9, Process Address Space, shows how the kernel copes with the requests for memory issued by greedy application programs.
第 10 章“ 系统调用”解释了在用户模式下运行的进程如何向内核发出请求,而第 11 章“ 信号”描述了进程如何向其他进程发送同步信号。现在我们准备转向另一个重要主题,Linux 如何实现文件系统。一系列章节涵盖了这个主题。第 12 章,虚拟文件系统,介绍了支持许多不同文件系统的通用层。一些 Linux 文件很特殊,因为它们提供了访问硬件设备的陷门;第 13 章,I/O 架构和设备驱动程序,以及第 14 章“块设备驱动程序”提供了有关这些特殊文件和相应硬件设备驱动程序的见解。
Chapter 10, System Calls, explains how a process running in User Mode makes requests to the kernel, while Chapter 11, Signals, describes how a process may send synchronization signals to other processes. Now we are ready to move on to another essential topic, how Linux implements the filesystem. A series of chapters cover this topic. Chapter 12, The Virtual Filesystem, introduces a general layer that supports many different filesystems. Some Linux files are special because they provide trapdoors to reach hardware devices; Chapter 13, I/O Architecture and Device Drivers, and Chapter 14, Block Device Drivers, offer insights on these special files and on the corresponding hardware device drivers.
另一个需要考虑的问题是磁盘访问时间;第 15 章,页面缓存,展示了如何巧妙地使用 RAM 来减少磁盘访问,从而显着提高系统性能。基于最后几章中介绍的材料,我们现在可以在第 16 章“访问文件”中解释用户应用程序如何访问普通文件。第 17 章,页帧回收,完成了对 Linux 内存管理的讨论,并解释了 Linux 使用的技术来确保始终有足够的内存可用。处理文件的最后一章是第 18 章,Ext2 和 Ext3 文件系统,它说明了最常用的 Linux 文件系统,即 Ext2 及其最近的演变 Ext3。
Another issue to consider is disk access time; Chapter 15, The Page Cache, shows how a clever use of RAM reduces disk accesses, therefore improving system performance significantly. Building on the material covered in these last chapters, we can now explain in Chapter 16, Accessing Files, how user applications access normal files. Chapter 17, Page Frame Reclaiming, completes our discussion of Linux memory management and explains the techniques used by Linux to ensure that enough memory is always available. The last chapter dealing with files is Chapter 18, The Ext2 and Ext3 Filesystems, which illustrates the most frequently used Linux filesystem, namely Ext2 and its recent evolution, Ext3.
最后两章结束了我们对 Linux 内核的详细介绍: 第 19 章,进程通信,介绍了除用户态进程可用的信号之外的通信机制;第 20 章,程序执行,解释了用户应用程序是如何启动的。
The last two chapters end our detailed tour of the Linux kernel: Chapter 19, Process Communication, introduces communication mechanisms other than signals available to User Mode processes; Chapter 20, Program Execution, explains how user applications are started.
最后但并非最不重要的是附录:附录 A,系统启动,概述了 Linux 的启动方式,而附录 B, 模块,描述了如何动态地重新配置正在运行的内核,根据需要添加和删除功能。源代码索引包括书中引用的所有 Linux 符号;在这里您可以找到定义每个符号的 Linux 文件的名称以及解释该符号的书籍页码。我们认为您会发现它非常方便。
Last, but not least, are the appendixes: Appendix A, System Startup, sketches out how Linux is booted, while Appendix B, Modules, describes how to dynamically reconfigure the running kernel, adding and removing functionalities as needed. The Source Code Index includes all the Linux symbols referenced in the book; here you will find the name of the Linux file defining each symbol and the book's page number where it is explained. We think you'll find it quite handy.
以下是本书中使用的印刷约定列表:
The following is a list of typographical conventions used in this book:
Constant WidthConstant Width用于显示代码文件的内容或命令的输出,并指示代码中出现的源代码关键字。
Used to show the contents of code files or the output from commands, and to indicate source code keywords that appear in code.
用于文件和目录名称、程序和命令名称、命令行选项和 URL,以及强调新术语。
Used for file and directory names, program and command names, command-line options, and URLs, and for emphasizing new terms.
请向出版商提出有关本书的意见和问题:
Please address comments and questions concerning this book to the publisher:
| 奥莱利媒体公司 |
| 格拉文斯坦公路北1005号 |
| 塞瓦斯托波尔, CA 95472 |
| (800) 998-9938(美国或加拿大) |
| (707) 829-0515(国际或本地) |
| (707) 829-0104(传真) |
我们有本书的网页,其中列出了勘误表、示例或任何其他信息。您可以通过以下地址访问此页面:
We have a web page for this book, where we list errata, examples, or any additional information. You can access this page at:
| http://www.oreilly.com/catalog/understandlk/ |
要评论或询问有关本书的技术问题,请发送电子邮件至:
To comment or ask technical questions about this book, send email to:
| bookquestions@oreilly.com |
有关我们的书籍、会议、资源中心和 O'Reilly Network 的更多信息,请访问我们的网站:
For more information about our books, conferences, Resource Centers, and the O'Reilly Network, see our web site at:
| http://www.oreilly.com |
当您在最喜欢的技术书籍的封面上看到“Safari® 已启用”图标时,这意味着该书可通过 O'Reilly Network Safari 书架在线获取。
When you see a Safari® Enabled icon on the cover of your favorite technology book, it means the book is available online through the O'Reilly Network Safari Bookshelf.
Safari 提供了比电子书更好的解决方案。它是一个虚拟图书馆,可让您轻松搜索数千本顶级技术书籍、剪切和粘贴代码示例、下载章节,并在需要最准确的最新信息时快速找到答案。请访问http://safari.oreilly.com免费试用。
Safari offers a solution that's better than e-books. It's a virtual library that lets you easily search thousands of top technology books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.
如果没有罗马大学工程学院“Tor Vergata”的许多学生的宝贵帮助,这本书就不会写成,他们参加了我们的课程并试图破译有关 Linux 内核的讲义。他们为掌握源代码的含义付出了巨大的努力,这使得我们改进了我们的演示并纠正了许多错误。
This book would not have been written without the precious help of the many students of the University of Rome school of engineering "Tor Vergata" who took our course and tried to decipher lecture notes about the Linux kernel. Their strenuous efforts to grasp the meaning of the source code led us to improve our presentation and correct many mistakes.
安迪·奥拉姆 (Andy Oram) 是 O'Reilly Media 出色的编辑,值得高度赞扬。他是 O'Reilly 第一个相信这个项目的人,他花了很多时间和精力来解读我们的初步草案。他还提出了许多使这本书更具可读性的方法,并且他写了几个优秀的介绍性段落。
Andy Oram, our wonderful editor at O'Reilly Media, deserves a lot of credit. He was the first at O'Reilly to believe in this project, and he spent a lot of time and energy deciphering our preliminary drafts. He also suggested many ways to make the book more readable, and he wrote several excellent introductory paragraphs.
我们有一些享有盛誉的审稿人,他们非常仔细地阅读了我们的文本。第一版由(按名字字母顺序排列)Alan Cox、Michael Kerrisk、Paul Kinzelman、Raph Levien 和 Rik van Riel 进行了检查。
We had some prestigious reviewers who read our text quite carefully. The first edition was checked by (in alphabetical order by first name) Alan Cox, Michael Kerrisk, Paul Kinzelman, Raph Levien, and Rik van Riel.
第二版由 Erez Zadok、Jerry Cooperstein、John Goerzen、Michael Kerrisk、Paul Kinzelman、Rik van Riel 和 Walt Smith 进行了检查。
The second edition was checked by Erez Zadok, Jerry Cooperstein, John Goerzen, Michael Kerrisk, Paul Kinzelman, Rik van Riel, and Walt Smith.
此版本已由 Charles P. Wright、Clemens Buchacher、Erez Zadok、Raphael Finkel、Rik van Riel 和 Robert PJ Day 审阅。他们的评论以及来自世界各地的许多读者的评论帮助我们消除了一些错误和不准确之处,并使本书变得更加强大。
This edition has been reviewed by Charles P. Wright, Clemens Buchacher, Erez Zadok, Raphael Finkel, Rik van Riel, and Robert P. J. Day. Their comments, together with those of many readers from all over the world, helped us to remove several errors and inaccuracies and have made this book stronger.
马可·塞萨蒂 2005 年 7 月
—丹尼尔·P·博维
Marco Cesati July 2005
—Daniel P. Bovet
Linux [ * ]是类 Unix 操作系统大家族的成员。Linux 是一个相对较新的操作系统,从 20 世纪 90 年代末开始突然大受欢迎,它加入了 System V 等著名的商业 Unix 操作系统行列。Release 4 (SVR4),由 AT&T(现归 SCO 集团所有)开发;4.4 BSD加州大学伯克利分校发布(4.4BSD);数字UNIX来自数字设备公司(现为惠普);AIX来自 IBM;惠普-UX来自惠普;索拉里斯来自太阳微系统公司;和 Mac OS X来自 Apple Computer, Inc. 除了 Linux 之外,还存在一些其他开源类 Unix 内核,例如 FreeBSD, 网络BSD和 OpenBSD。
Linux[*] is a member of the large family of Unix-like operating systems . A relative newcomer experiencing sudden spectacular popularity starting in the late 1990s, Linux joins such well-known commercial Unix operating systems as System V Release 4 (SVR4), developed by AT&T (now owned by the SCO Group); the 4.4 BSD release from the University of California at Berkeley (4.4BSD); Digital UNIX from Digital Equipment Corporation (now Hewlett-Packard); AIX from IBM; HP-UX from Hewlett-Packard; Solaris from Sun Microsystems; and Mac OS X from Apple Computer, Inc. Beside Linux, a few other opensource Unix-like kernels exist, such as FreeBSD , NetBSD , and OpenBSD .
Linux 最初由 Linus Torvalds 于 1991 年开发,作为基于 Intel 80386 微处理器的 IBM 兼容个人计算机的操作系统。Linus 仍然深入参与 Linux 的改进,使其与各种硬件开发保持同步,并协调世界各地数百名 Linux 开发人员的活动。多年来,开发人员一直致力于使 Linux 在其他架构上可用,包括惠普的 Alpha、英特尔的 Itanium、AMD 的 AMD64、PowerPC 和 IBM 的 zSeries。
Linux was initially developed by Linus Torvalds in 1991 as an operating system for IBM-compatible personal computers based on the Intel 80386 microprocessor. Linus remains deeply involved with improving Linux, keeping it up-to-date with various hardware developments and coordinating the activity of hundreds of Linux developers around the world. Over the years, developers have worked to make Linux available on other architectures, including Hewlett-Packard's Alpha, Intel's Itanium, AMD's AMD64, PowerPC, and IBM's zSeries.
Linux 更有吸引力的好处之一是它不是一个商业操作系统:其源代码遵循GNU 通用公共许可证 ( GPL ) [ † ]是开放的,任何人都可以学习(正如我们在本书中所做的那样);如果您下载代码(官方站点是http://www.kernel.org)或检查 Linux CD 上的源代码,您将能够从上到下探索最成功的现代操作系统之一。事实上,本书假设您手头有源代码,并且可以将我们所说的内容应用到您自己的探索中。
One of the more appealing benefits to Linux is that it isn't a commercial operating system: its source code under the GNU General Public License (GPL)[†] is open and available to anyone to study (as we will in this book); if you download the code (the official site is http://www.kernel.org) or check the sources on a Linux CD, you will be able to explore, from top to bottom, one of the most successful modern operating systems. This book, in fact, assumes you have the source code on hand and can apply what we say to your own explorations.
从技术上讲,Linux 是一个真正的 Unix 内核,尽管它不是一个完整的 Unix 操作系统,因为它不包括所有 Unix 应用程序,例如文件系统实用程序、窗口系统和图形桌面、系统管理员命令、文本编辑器、编译器和很快。然而,由于大多数这些程序都是在 GPL 下免费提供的,因此它们可以安装在每个基于 Linux 的系统中。
Technically speaking, Linux is a true Unix kernel, although it is not a full Unix operating system because it does not include all the Unix applications, such as filesystem utilities, windowing systems and graphical desktops, system administrator commands, text editors, compilers, and so on. However, because most of these programs are freely available under the GPL, they can be installed in every Linux-based system.
由于 Linux 内核需要大量附加软件来提供有用的环境,因此许多 Linux 用户更喜欢依赖 CD-ROM 上的商业发行版来获取标准 Unix 系统中包含的代码。或者,可以从几个不同的站点获取代码,例如http://www.kernel.org。一些发行版将 Linux 源代码放在/usr/src/linux 目录中。在本书的其余部分中,所有文件路径名都将隐式引用 Linux 源代码目录。
Because the Linux kernel requires so much additional software to provide a useful environment, many Linux users prefer to rely on commercial distributions, available on CD-ROM, to get the code included in a standard Unix system. Alternatively, the code may be obtained from several different sites, for instance http://www.kernel.org. Several distributions put the Linux source code in the /usr/src/linux directory. In the rest of this book, all file pathnames will refer implicitly to the Linux source code directory.
[ † ] GNU 项目由自由软件基金会 (Free Software Foundation, Inc.) 协调 ( http://www.gnu.org );其目标是实现一个可供所有人免费使用的完整操作系统。GNU C 编译器的可用性对于 Linux 项目的成功至关重要。
[†] The GNU project is coordinated by the Free Software Foundation, Inc. (http://www.gnu.org); its aim is to implement a whole operating system freely usable by everyone. The availability of a GNU C compiler has been essential for the success of the Linux project.
市场上的各种类 Unix 系统在许多重要方面都存在差异,其中一些系统具有悠久的历史并显示出过时的实践迹象。所有商业变体都源自 SVR4 或 4.4BSD,并且都倾向于同意一些通用标准,例如 IEEE 的基于 Unix ( POSIX )的便携式操作系统和 X/Open 的通用应用程序环境 (CAE)。
The various Unix-like systems on the market, some of which have a long history and show signs of archaic practices, differ in many important respects. All commercial variants were derived from either SVR4 or 4.4BSD, and all tend to agree on some common standards like IEEE's Portable Operating Systems based on Unix (POSIX) and X/Open's Common Applications Environment (CAE).
当前的标准仅指定了应用程序编程接口(API),即用户程序应在其中运行的明确定义的环境。因此,标准不会对兼容内核的内部设计选择施加任何限制。[ * ]
The current standards specify only an application programming interface (API)—that is, a well-defined environment in which user programs should run. Therefore, the standards do not impose any restriction on internal design choices of a compliant kernel.[*]
为了定义通用的用户界面,类 Unix 内核通常共享基本的设计思想和功能。在这方面,Linux 与其他类 Unix 操作系统不相上下。因此,阅读本书并研究 Linux 内核也可以帮助您了解其他 Unix 变体。
To define a common user interface, Unix-like kernels often share fundamental design ideas and features. In this respect, Linux is comparable with the other Unix-like operating systems. Reading this book and studying the Linux kernel, therefore, may help you understand the other Unix variants, too.
2.6版本的Linux内核旨在兼容IEEE POSIX标准。当然,这意味着大多数现有的 Unix 程序可以在 Linux 系统上编译并执行,只需很少的努力,甚至不需要对源代码进行修补。此外,Linux 还包括现代 Unix 操作系统的所有功能,例如虚拟内存、虚拟文件系统、轻量级进程、Unix 信号、SVR4进程间通信、支持对称多处理器(SMP)系统等等。
The 2.6 version of the Linux kernel aims to be compliant with the IEEE POSIX standard. This, of course, means that most existing Unix programs can be compiled and executed on a Linux system with very little effort or even without the need for patches to the source code. Moreover, Linux includes all the features of a modern Unix operating system, such as virtual memory, a virtual filesystem, lightweight processes, Unix signals , SVR4 interprocess communications, support for Symmetric Multiprocessor (SMP) systems, and so on.
当 Linus Torvalds 编写第一个内核时,他参考了一些有关 Unix 内部原理的经典书籍,例如 Maurice Bach 的《Unix 操作系统的设计》(Prentice Hall,1986)。实际上,Linux 仍然对巴赫书中描述的 Unix 基线(即 SVR2)有一些偏见。然而,Linux 并不拘泥于任何特定的变体。相反,它尝试采用几种不同 Unix 内核的最佳功能和设计选择。
When Linus Torvalds wrote the first kernel, he referred to some classical books on Unix internals, like Maurice Bach's The Design of the Unix Operating System (Prentice Hall, 1986). Actually, Linux still has some bias toward the Unix baseline described in Bach's book (i.e., SVR2). However, Linux doesn't stick to any particular variant. Instead, it tries to adopt the best features and design choices of several different Unix kernels.
下面的列表描述了 Linux 如何与一些著名的商业 Unix 内核竞争:
The following list describes how Linux competes against some well-known commercial Unix kernels:
它是一个大型、复杂的 DIY 程序,由几个逻辑上不同的组件组成。在这一点上,这是很传统的。大多数商业 Unix 变体都是单一的。(值得注意的例外是 Apple Mac OS X和 GNU 赫德操作系统,均源自卡内基梅隆大学的 Mach,遵循微内核方法。)
It is a large, complex do-it-yourself program, composed of several logically different components. In this, it is quite conventional; most commercial Unix variants are monolithic. (Notable exceptions are the Apple Mac OS X and the GNU Hurd operating systems, both derived from the Carnegie-Mellon's Mach, which follow a microkernel approach.)
大多数现代内核可以动态加载和卸载内核代码的某些部分(通常是设备驱动程序),这些部分通常称为模块。Linux对模块的支持非常好,因为它能够按需自动加载和卸载模块。在主要的商业 Unix 变体中,只有 SVR4.2 和 Solaris内核也有类似的功能。
Most modern kernels can dynamically load and unload some portions of the kernel code (typically, device drivers), which are usually called modules . Linux's support for modules is very good, because it is able to automatically load and unload modules on demand. Among the main commercial Unix variants, only the SVR4.2 and Solaris kernels have a similar feature.
某些 Unix 内核(例如 Solaris 和 SVR4.2/MP)被组织为一组内核线程。内核线程是一个可以独立调度的执行上下文;它可能与用户程序相关联,或者可能仅运行某些内核功能。内核线程之间的上下文切换通常比普通进程之间的上下文切换便宜得多,因为前者通常在公共地址空间上操作。Linux 以非常有限的方式使用内核线程来定期执行一些内核函数;然而,它们并不代表基本的执行上下文抽象。(这是下一个项目的主题。)
Some Unix kernels, such as Solaris and SVR4.2/MP, are organized as a set of kernel threads . A kernel thread is an execution context that can be independently scheduled; it may be associated with a user program, or it may run only some kernel functions. Context switches between kernel threads are usually much less expensive than context switches between ordinary processes, because the former usually operate on a common address space. Linux uses kernel threads in a very limited way to execute a few kernel functions periodically; however, they do not represent the basic execution context abstraction. (That's the topic of the next item.)
大多数现代操作系统都对多线程应用程序提供某种支持— 也就是说,根据许多相对独立的执行流设计的用户程序,这些执行流共享大部分应用程序数据结构。多线程用户应用程序可以由许多轻量级进程组成(LWP),它们是可以在公共地址空间、公共物理内存页、公共打开的文件等上操作的进程。Linux 定义了自己的轻量级进程版本,这与 SVR4 和 Solaris 等其他系统上使用的类型不同。虽然 LWP 的所有商业 Unix 变体都基于内核线程,但 Linux 将轻量级进程视为基本执行上下文,并通过非标准处理它们clone( ) 系统调用。
Most modern operating systems have some kind of support for
multithreaded applications — that is, user programs that are designed in terms
of many relatively independent execution flows that share a large
portion of the application data structures. A multithreaded user
application could be composed of many lightweight
processes (LWP), which are processes that can operate on a
common address space, common physical memory pages, common opened
files, and so on. Linux defines its own version of lightweight
processes, which is different from the types used on other systems
such as SVR4 and Solaris. While all the commercial Unix variants
of LWP are based on kernel threads, Linux regards lightweight
processes as the basic execution context and handles them via the
nonstandard clone( ) system call.
当使用“可抢占内核”选项进行编译时,Linux 2.6 可以在特权模式下任意交错执行流。除了 Linux 2.6 之外,还有一些其他传统的通用 Unix 系统,例如 Solaris和3.0马赫,是完全抢占式的内核。SVR4.2/MP引入了一些固定抢占点 作为获得有限抢占能力的方法。
When compiled with the "Preemptible Kernel" option, Linux 2.6 can arbitrarily interleave execution flows while they are in privileged mode. Besides Linux 2.6, a few other conventional, general-purpose Unix systems, such as Solaris and Mach 3.0 , are fully preemptive kernels. SVR4.2/MP introduces some fixed preemption points as a method to get limited preemption capability.
多个 Unix 内核变体利用了多处理器系统。Linux 2.6 支持对称多处理(SMP)针对不同的内存模型,包括NUMA:系统可以使用多个处理器,每个处理器可以处理任何任务——它们之间没有区别。尽管内核代码的少数部分仍然是通过单个“大内核锁”的方式序列化的” 可以公平地说,Linux 2.6 近乎最佳地利用了 SMP。
Several Unix kernel variants take advantage of multiprocessor systems. Linux 2.6 supports symmetric multiprocessing (SMP ) for different memory models, including NUMA: the system can use multiple processors and each processor can handle any task — there is no discrimination among them. Although a few parts of the kernel code are still serialized by means of a single "big kernel lock ," it is fair to say that Linux 2.6 makes a near optimal use of SMP.
Linux的标准文件系统 有多种口味。如果您没有特定需求,可以使用普通的旧 Ext2 文件系统。如果您想避免系统崩溃后进行冗长的文件系统检查,您可以切换到 Ext3。如果您需要处理许多小文件,ReiserFS文件系统可能是最好的选择。除了 Ext3 和 ReiserFS 之外,Linux 中还可以使用其他几种日志文件系统;它们包括 IBM AIX 的日志文件系统 (JFS)和 Silicon Graphics IRIX的 XFS文件系统。得益于强大的面向对象虚拟文件系统技术(受 Solaris 和 SVR4 的启发),将外部文件系统移植到 Linux 通常比移植到其他内核更容易。
Linux's standard filesystems come in many flavors. You can use the plain old Ext2 filesystem if you don't have specific needs. You might switch to Ext3 if you want to avoid lengthy filesystem checks after a system crash. If you'll have to deal with many small files, the ReiserFS filesystem is likely to be the best choice. Besides Ext3 and ReiserFS, several other journaling filesystems can be used in Linux; they include IBM AIX's Journaling File System (JFS ) and Silicon Graphics IRIX 's XFS filesystem. Thanks to a powerful object-oriented Virtual File System technology (inspired by Solaris and SVR4), porting a foreign filesystem to Linux is generally easier than porting to other kernels.
Linux 没有与 SVR4 中引入的 STREAMS I/O 子系统类似的东西,尽管它现在已包含在大多数 Unix 内核中,并且已成为编写设备驱动程序、终端驱动程序和网络协议的首选接口。
Linux has no analog to the STREAMS I/O subsystem introduced in SVR4, although it is included now in most Unix kernels and has become the preferred interface for writing device drivers, terminal drivers, and network protocols.
这一评估表明Linux如今与商业操作系统完全具有竞争力。此外,Linux 具有多种特性,使其成为一个令人兴奋的操作系统。商业 Unix 内核经常引入新功能来获得更大的市场份额,但这些功能不一定有用、稳定或高效。事实上,现代 Unix 内核往往相当臃肿。相比之下,Linux 与其他开源操作系统一样,不受市场强加的限制和制约,因此它可以根据其设计者(主要是 Linus Torvalds)的想法自由发展。具体来说,与商业竞争对手相比,Linux 具有以下优势:
This assessment suggests that Linux is fully competitive nowadays with commercial operating systems. Moreover, Linux has several features that make it an exciting operating system. Commercial Unix kernels often introduce new features to gain a larger slice of the market, but these features are not necessarily useful, stable, or productive. As a matter of fact, modern Unix kernels tend to be quite bloated. By contrast, Linux—together with the other open source operating systems—doesn't suffer from the restrictions and the conditioning imposed by the market, hence it can freely evolve according to the ideas of its designers (mainly Linus Torvalds). Specifically, Linux offers the following advantages over its commercial competitors:
你可以安装一个完整的 Unix 系统,除了硬件之外不需要任何花费(当然)。
You can install a complete Unix system at no expense other than the hardware (of course).
借助编译选项,您可以通过仅选择真正需要的功能来自定义内核。此外,得益于GPL,您可以自由阅读和修改内核和所有系统程序的源代码。[ * ]
Thanks to the compilation options, you can customize the kernel by selecting only the features really needed. Moreover, thanks to the GPL, you are allowed to freely read and modify the source code of the kernel and of all system programs.[*]
您可以使用具有 4 MB RAM 的旧 Intel 80386 系统构建网络服务器。
You are able to build a network server using an old Intel 80386 system with 4 MB of RAM.
Linux 系统速度非常快,因为它们充分利用了硬件组件的功能。Linux 的主要目标是效率,事实上,商业变体的许多设计选择(例如 STREAMS I/O 子系统)由于隐含的性能损失而被 Linus 拒绝。
Linux systems are very fast, because they fully exploit the features of the hardware components. The main Linux goal is efficiency, and indeed many design choices of commercial variants, like the STREAMS I/O subsystem, have been rejected by Linus because of their implied performance penalty.
Linux系统非常稳定;它们的故障率和系统维护时间都很低。
Linux systems are very stable; they have a very low failure rate and system maintenance time.
只需一张 1.44 MB 软盘即可容纳内核映像(包括一些系统程序)。据我们所知,没有一个商业 Unix 变体能够从一张软盘启动。
It is possible to fit a kernel image, including a few system programs, on just one 1.44 MB floppy disk. As far as we know, none of the commercial Unix variants is able to boot from a single floppy disk.
Linux 允许您直接挂载所有版本的 MS-DOS 的文件系统和微软视窗、SVR4、OS/2, Mac OS X, 索拉里斯、Sun操作系统, 下一步, 许多 BSD变体等等。Linux 还能够在许多网络层上运行,例如以太网(以及快速以太网、千兆位以太网和 10 吉位以太网)、光纤分布式数据接口 (FDDI)、高性能并行接口 (HIPPI)、IEEE 802.11(无线LAN)和 IEEE 802.15(蓝牙)。通过使用合适的库,Linux系统甚至能够直接运行为其他操作系统编写的程序。例如,Linux 能够执行一些为 MS-DOS、Microsoft Windows、SVR3 和 R4、4.4BSD、SCO Unix 编写的应用程序, 克尼克斯,以及80×86平台上的其他内容。
Linux lets you directly mount filesystems for all versions of MS-DOS and Microsoft Windows , SVR4, OS/2 , Mac OS X , Solaris , SunOS , NEXTSTEP , many BSD variants, and so on. Linux also is able to operate with many network layers, such as Ethernet (as well as Fast Ethernet, Gigabit Ethernet, and 10 Gigabit Ethernet), Fiber Distributed Data Interface (FDDI), High Performance Parallel Interface (HIPPI), IEEE 802.11 (Wireless LAN), and IEEE 802.15 (Bluetooth). By using suitable libraries, Linux systems are even able to directly run programs written for other operating systems. For example, Linux is able to execute some applications written for MS-DOS, Microsoft Windows, SVR3 and R4, 4.4BSD, SCO Unix , Xenix , and others on the 80×86 platform.
不管你相信与否,获取 Linux 的补丁和更新可能比任何专有操作系统容易得多。向某个新闻组或邮件列表发送消息后,问题的答案通常会在几个小时内返回。此外,Linux 的驱动程序通常在新硬件产品上市几周后即可提供。相比之下,硬件制造商只为少数商业操作系统(通常是微软的操作系统)发布设备驱动程序。因此,所有商业 Unix 变体都在有限的硬件组件子集上运行。
Believe it or not, it may be a lot easier to get patches and updates for Linux than for any proprietary operating system. The answer to a problem often comes back within a few hours after sending a message to some newsgroup or mailing list. Moreover, drivers for Linux are usually available a few weeks after new hardware products have been introduced on the market. By contrast, hardware manufacturers release device drivers for only a few commercial operating systems — usually Microsoft's. Therefore, all commercial Unix variants run on a restricted subset of hardware components.
据估计安装基数为数千万,习惯了其他操作系统下某些标准功能的人们开始对 Linux 抱有同样的期望。在这方面,对Linux开发人员的需求也在不断增加。不过幸运的是,Linux 在 Linus 及其子系统维护者的密切指导下不断发展,以满足大众的需求。
With an estimated installed base of several tens of millions, people who are used to certain features that are standard under other operating systems are starting to expect the same from Linux. In that regard, the demand on Linux developers is also increasing. Luckily, though, Linux has evolved under the close direction of Linus and his subsystem maintainers to accommodate the needs of the masses.
Linux 试图在依赖于硬件的源代码和独立于硬件的源代码之间保持清晰的区别。为此,arch和include目录都包含 23 个子目录,对应于所支持的不同类型的硬件平台。平台的标准名称是:
Linux tries to maintain a neat distinction between hardware-dependent and hardware-independent source code. To that end, both the arch and the include directories include 23 subdirectories that correspond to the different types of hardware platforms supported. The standard names of the platforms are:
惠普的 Alpha 工作站(最初是 Digital,后来是 Compaq;不再生产)
Hewlett-Packard's Alpha workstations (originally Digital, then Compaq; no longer manufactured)
ARM processor-based computers such as PDAs and embedded devices
安讯士在其瘦服务器(例如网络摄像头或开发板)中使用的“代码精简指令集”CPU
"Code Reduced Instruction Set" CPUs used by Axis in its thin-servers, such as web cameras or development boards
基于富士通 FR-V 系列微处理器的嵌入式系统
Embedded systems based on microprocessors of the Fujitsu's FR-V family
Hitachi h8/300 和 h8S RISC 8/16 位微处理器
Hitachi h8/300 and h8S RISC 8/16-bit microprocessors
基于 80×86 微处理器的 IBM 兼容个人计算机
IBM-compatible personal computers based on 80×86 microprocessors
Workstations based on the Intel 64-bit Itanium microprocessor
基于 Renesas M32R 系列微处理器的计算机
Computers based on the Renesas M32R family of microprocessors
基于 Motorola MC680×0 微处理器的个人计算机
Personal computers based on Motorola MC680×0 microprocessors
Workstations based on MIPS microprocessors, such as those marketed by Silicon Graphics
基于 Hewlett Packard HP 9000 PA-RISC 微处理器的工作站
Workstations based on Hewlett Packard HP 9000 PA-RISC microprocessors
Workstations based on the 32-bit and 64-bit Motorola-IBM PowerPC microprocessors
IBM ESA/390 和 zSeries 大型机
IBM ESA/390 and zSeries mainframes
基于日立和意法半导体开发的SuperH微处理器的嵌入式系统
Embedded systems based on SuperH microprocessors developed by Hitachi and STMicroelectronics
基于 Sun Microsystems SPARC 和 64 位 Ultra SPARC 微处理器的工作站
Workstations based on Sun Microsystems SPARC and 64-bit Ultra SPARC microprocessors
用户模式Linux,一个虚拟平台,允许开发人员在用户模式下运行内核
User Mode Linux, a virtual platform that allows developers to run a kernel in User Mode
NEC V850 微控制器,采用基于哈佛架构的 32 位 RISC 内核
NEC V850 microcontrollers that incorporate a 32-bit RISC core based on the Harvard architecture
基于 AMD 64 位微处理器的工作站 - 例如 Athlon和皓龙—以及英特尔的 ia32e/EM64T64 位微处理器
Workstations based on the AMD's 64-bit microprocessors—such Athlon and Opteron —and Intel's ia32e/EM64T 64-bit microprocessors
直到内核版本 2.5,Linux 通过简单的编号方案来识别内核。每个版本都由三个数字来表征,并用句点分隔。前两个数字用于识别版本;第三个数字标识了该版本。第一个版本号,即2,自1996年以来一直保持不变。第二个版本号标识了内核的类型:如果是偶数,则表示稳定版本;如果是偶数,则表示稳定版本。否则,它表示开发版本。
Up to kernel version 2.5, Linux identified kernels through a simple numbering scheme. Each version was characterized by three numbers, separated by periods. The first two numbers were used to identify the version; the third number identified the release. The first version number, namely 2, has stayed unchanged since 1996. The second version number identified the type of kernel: if it was even, it denoted a stable version; otherwise, it denoted a development version.
顾名思义,稳定版本经过 Linux 发行商和内核黑客的彻底检查。发布新的稳定版本只是为了解决错误并添加新的设备驱动程序。另一方面,开发版本之间存在很大差异。内核开发人员可以自由地尝试不同的解决方案,这些解决方案有时会导致内核发生剧烈变化。依赖开发版本运行应用程序的用户在将内核升级到新版本时可能会遇到令人不快的意外。
As the name suggests, stable versions were thoroughly checked by Linux distributors and kernel hackers. A new stable version was released only to address bugs and to add new device drivers. Development versions, on the other hand, differed quite significantly from one another; kernel developers were free to experiment with different solutions that occasionally lead to drastic kernel changes. Users who relied on development versions for running applications could experience unpleasant surprises when upgrading their kernel to a newer release.
然而,在 Linux 内核版本 2.6 的开发过程中,版本编号方案发生了重大变化。基本上,第二个数字不再标识稳定版本或开发版本;因此,现在内核开发人员在当前的内核版本 2.6 中引入了巨大且显着的变化。仅当内核开发人员必须测试真正具有破坏性的更改时,才会创建新的内核 2.7 分支;这个 2.7 分支将导致新的当前内核版本,或者将向后移植到 2.6 版本,或者最终将作为死胡同被丢弃。
During development of Linux kernel version 2.6, however, a significant change in the version numbering scheme has taken place. Basically, the second number no longer identifies stable or development versions; thus, nowadays kernel developers introduce large and significant changes in the current kernel version 2.6. A new kernel 2.7 branch will be created only when kernel developers will have to test a really disruptive change; this 2.7 branch will lead to a new current kernel version, or it will be backported to the 2.6 version, or finally it will simply be dropped as a dead end.
Linux 开发的新模型意味着具有相同版本但不同版本号的两个内核(例如 2.6.10 和 2.6.11)甚至在核心组件和基本算法方面也可能存在显着差异。因此,当新的内核版本出现时,它可能不稳定且存在错误。为了解决这个问题,内核开发人员可能会发布任何内核的修补版本,这些版本由版本编号方案中的第四个数字标识。例如,在撰写本段时,最新的“稳定”内核版本是 2.6.11.12。
The new model of Linux development implies that two kernels having the same version but different release numbers—for instance, 2.6.10 and 2.6.11—can differ significantly even in core components and in fundamental algorithms. Thus, when a new kernel release appears, it is potentially unstable and buggy. To address this problem, the kernel developers may release patched versions of any kernel, which are identified by a fourth number in the version numbering scheme. For instance, at the time this paragraph was written, the latest "stable" kernel version was 2.6.11.12.
请注意,本书中描述的内核版本是Linux 2.6.11。
Please be aware that the kernel version described in this book is Linux 2.6.11.
每个计算机系统都包含一组称为操作系统的基本程序。该集合中最重要的程序称为内核。它在系统启动时加载到 RAM 中,包含系统运行所需的许多关键程序。其他程序不是那么重要的实用程序;它们可以为用户提供各种各样的交互体验,并完成用户购买计算机的所有工作,但系统的基本形状和功能是由内核决定的。内核为系统上的其他一切提供关键设施,并确定高级软件的许多特征。因此,我们经常使用术语“操作系统”作为“内核”的同义词。
Each computer system includes a basic set of programs called the operating system. The most important program in the set is called the kernel. It is loaded into RAM when the system boots and contains many critical procedures that are needed for the system to operate. The other programs are less crucial utilities; they can provide a wide variety of interactive experiences for the user—as well as doing all the jobs the user bought the computer for—but the essential shape and capabilities of the system are determined by the kernel. The kernel provides key facilities to everything else on the system and determines many of the characteristics of higher software. Hence, we often use the term "operating system" as a synonym for "kernel."
操作系统必须实现两个主要目标:
The operating system must fulfill two main objectives:
与硬件组件交互,为硬件平台中包含的所有低级可编程元素提供服务。
Interact with the hardware components, servicing all low-level programmable elements included in the hardware platform.
为计算机系统上运行的应用程序(即所谓的用户程序)提供执行环境。
Provide an execution environment to the applications that run on the computer system (the so-called user programs).
一些操作系统允许所有用户程序直接使用硬件组件(典型的例子是MS-DOS)。相比之下,类 Unix 操作系统对用户运行的应用程序隐藏了有关计算机物理组织的所有低级详细信息。当程序想要使用硬件资源时,必须向操作系统发出请求。内核评估请求,如果选择授予资源,则代表用户程序与适当的硬件组件进行交互。
Some operating systems allow all user programs to directly play with the hardware components (a typical example is MS-DOS ). In contrast, a Unix-like operating system hides all low-level details concerning the physical organization of the computer from applications run by the user. When a program wants to use a hardware resource, it must issue a request to the operating system. The kernel evaluates the request and, if it chooses to grant the resource, interacts with the proper hardware components on behalf of the user program.
为了实施这种机制,现代操作系统依赖于特定硬件功能的可用性,这些功能禁止用户程序直接与低级硬件组件交互或访问任意内存位置。特别地,硬件为CPU引入了至少两种不同的执行模式:用于用户程序的非特权模式和用于内核的特权模式。Unix 称这些为用户模式 和内核模式 , 分别。
To enforce this mechanism, modern operating systems rely on the availability of specific hardware features that forbid user programs to directly interact with low-level hardware components or to access arbitrary memory locations. In particular, the hardware introduces at least two different execution modes for the CPU: a nonprivileged mode for user programs and a privileged mode for the kernel. Unix calls these User Mode and Kernel Mode , respectively.
在本章的其余部分,我们将介绍过去二十年来推动 Unix 设计的基本概念,以及 Linux 和其他操作系统。虽然这些概念对于 Linux 用户来说可能很熟悉,但这些部分尝试比平时更深入地研究它们,以解释它们对操作系统内核的要求。这些广泛的考虑几乎涉及所有类 Unix 系统。本书的其他章节有望帮助您了解 Linux 内核的内部结构。
In the rest of this chapter, we introduce the basic concepts that have motivated the design of Unix over the past two decades, as well as Linux and other operating systems. While the concepts are probably familiar to you as a Linux user, these sections try to delve into them a bit more deeply than usual to explain the requirements they place on an operating system kernel. These broad considerations refer to virtually all Unix-like systems. The other chapters of this book will hopefully help you understand the Linux kernel internals.
多用户系统是能够同时且独立地执行属于两个或多个用户的多个应用程序的计算机。 并发是指应用程序可以同时处于活动状态并争夺CPU、内存、硬盘等各种资源。独立地 意味着每个应用程序都可以执行其任务,而无需关心其他用户的应用程序正在做什么。当然,从一个应用程序切换到另一个应用程序会减慢每个应用程序的速度并影响用户看到的响应时间。我们将在本书中研究现代操作系统内核的许多复杂性,它们的存在是为了最大限度地减少每个程序的延迟,并为用户提供尽可能快的响应。
A multiuser system is a computer that is able to concurrently and independently execute several applications belonging to two or more users. Concurrently means that applications can be active at the same time and contend for the various resources such as CPU, memory, hard disks, and so on. Independently means that each application can perform its task with no concern for what the applications of the other users are doing. Switching from one application to another, of course, slows down each of them and affects the response time seen by the users. Many of the complexities of modern operating system kernels, which we will examine in this book, are present to minimize the delays enforced on each program and to provide the user with responses that are as fast as possible.
多用户操作系统必须包含以下几个功能:
Multiuser operating systems must include several features:
用于验证用户身份的认证机制
An authentication mechanism for verifying the user's identity
针对有错误的用户程序的保护机制,这些程序可能会阻止系统中运行的其他应用程序
A protection mechanism against buggy user programs that could block other applications running in the system
针对可能干扰或监视其他用户活动的恶意用户程序的保护机制
A protection mechanism against malicious user programs that could interfere with or spy on the activity of other users
限制分配给每个用户的资源单元数量的计费机制
An accounting mechanism that limits the amount of resource units assigned to each user
为了确保安全的保护机制,操作系统必须使用与CPU特权模式相关的硬件保护。否则,用户程序将能够直接访问系统电路并克服强加的限制。Unix 是一个多用户系统,对系统资源实施硬件保护。
To ensure safe protection mechanisms, operating systems must use the hardware protection associated with the CPU privileged mode. Otherwise, a user program would be able to directly access the system circuitry and overcome the imposed bounds. Unix is a multiuser system that enforces the hardware protection of system resources.
在多用户系统中,每个用户在机器上都有一个私人空间;通常,他拥有一定的磁盘空间配额来存储文件、接收私人邮件消息等。操作系统必须确保用户空间的私有部分仅对其所有者可见。特别是,它必须确保任何用户都不能利用系统应用程序来侵犯其他用户的私人空间。
In a multiuser system, each user has a private space on the machine; typically, he owns some quota of the disk space to store files, receives private mail messages, and so on. The operating system must ensure that the private portion of a user space is visible only to its owner. In particular, it must ensure that no user can exploit a system application for the purpose of violating the private space of another user.
所有用户均由称为用户 ID或UID 的唯一编号进行标识 。通常只允许有限数量的人使用计算机系统。当这些用户之一开始工作会话时,系统会要求输入登录名 和一个密码。如果用户未输入有效的对,系统将拒绝访问。由于密码被假定为秘密的,因此确保了用户的隐私。
All users are identified by a unique number called the User ID, or UID. Usually only a restricted number of persons are allowed to make use of a computer system. When one of these users starts a working session, the system asks for a login name and a password. If the user does not input a valid pair, the system denies access. Because the password is assumed to be secret, the user's privacy is ensured.
为了有选择地与其他用户共享材料,每个用户都是一个或多个用户组的成员 ,由称为 用户组 ID 的唯一编号来标识 。每个文件都与一组相关联。例如,可以设置访问权限,以便拥有该文件的用户具有读写权限,该组具有只读权限,并且拒绝系统上的其他用户访问该文件。
To selectively share material with other users, each user is a member of one or more user groups , which are identified by a unique number called a user group ID . Each file is associated with exactly one group. For example, access can be set so the user owning the file has read and write privileges, the group has read-only privileges, and other users on the system are denied access to the file.
任何类 Unix 操作系统都有一个称为 root或超级用户的特殊用户 。系统管理员必须以root身份登录来处理用户帐户、执行系统备份和程序升级等维护任务。root 用户几乎可以做所有事情,因为操作系统没有对她应用通常的保护机制。特别是,root 用户可以访问系统上的每个文件,并且可以操作每个正在运行的用户程序。
Any Unix-like operating system has a special user called root or superuser . The system administrator must log in as root to handle user accounts, perform maintenance tasks such as system backups and program upgrades, and so on. The root user can do almost everything, because the operating system does not apply the usual protection mechanisms to her. In particular, the root user can access every file on the system and can manipulate every running user program.
所有操作系统都使用一个基本抽象:进程。进程可以定义为“正在执行的程序的实例”,也可以定义为正在运行的程序的“执行上下文”。在传统操作系统中,进程在地址空间中执行单个指令序列;地址空间是允许进程引用的一组内存地址。现代操作系统允许进程具有多个执行流——即在同一地址空间中执行多个指令序列。
All operating systems use one fundamental abstraction: the process. A process can be defined either as "an instance of a program in execution" or as the "execution context" of a running program. In traditional operating systems, a process executes a single sequence of instructions in an address space; the address space is the set of memory addresses that the process is allowed to reference. Modern operating systems allow processes with multiple execution flows — that is, multiple sequences of instructions executed in the same address space.
多用户系统必须强制执行一个执行环境,其中多个进程可以同时活动并争夺系统资源(主要是 CPU)。允许并发活动进程的系统被称为多道程序设计 或多处理 。[ * ]区分程序和进程很重要;多个进程可以同时执行同一个程序,而同一个进程可以顺序执行多个程序。
Multiuser systems must enforce an execution environment in which several processes can be active concurrently and contend for system resources, mainly the CPU. Systems that allow concurrent active processes are said to be multiprogramming or multiprocessing .[*] It is important to distinguish programs from processes; several processes can execute the same program concurrently, while the same process can execute several programs sequentially.
在单处理器系统上,只有一个进程可以占用 CPU,因此一次只能有一个执行流进行。一般来说,CPU 的数量总是受到限制,因此只有少数进程可以同时进行。称为调度程序的操作系统组件选择可以继续进行的进程。有些操作系统只允许 不可抢占的进程,这意味着只有当进程自愿放弃 CPU 时才会调用调度程序。但多用户系统的进程必须是可 抢占的;操作系统跟踪每个进程占用CPU的时间并定期激活调度程序。
On uniprocessor systems, just one process can hold the CPU, and hence just one execution flow can progress at a time. In general, the number of CPUs is always restricted, and therefore only a few processes can progress at once. An operating system component called the scheduler chooses the process that can progress. Some operating systems allow only nonpreemptable processes, which means that the scheduler is invoked only when a process voluntarily relinquishes the CPU. But processes of a multiuser system must be preemptable; the operating system tracks how long each process holds the CPU and periodically activates the scheduler.
Unix 是一个具有可抢占进程的多处理操作系统。即使没有用户登录并且没有应用程序运行,多个系统进程也会监视外围设备。特别是,有几个进程在系统终端上监听,等待用户登录。当用户输入登录名时,侦听进程会运行一个程序来验证用户密码。如果用户身份得到确认,该进程将创建另一个运行 shell 的进程,在该 shell 中输入命令。当图形显示被激活时,一个进程运行窗口管理器,并且显示器上的每个窗口通常由单独的进程运行。当用户创建图形 shell 时,一个进程运行图形窗口,第二个进程运行用户可以在其中输入命令的 shell。对于每个用户命令,
Unix is a multiprocessing operating system with preemptable processes . Even when no user is logged in and no application is running, several system processes monitor the peripheral devices. In particular, several processes listen at the system terminals waiting for user logins. When a user inputs a login name, the listening process runs a program that validates the user password. If the user identity is acknowledged, the process creates another process that runs a shell into which commands are entered. When a graphical display is activated, one process runs the window manager, and each window on the display is usually run by a separate process. When a user creates a graphics shell, one process runs the graphics windows and a second process runs the shell into which the user can enter the commands. For each user command, the shell process creates another process that executes the corresponding program.
类 Unix 操作系统采用进程/内核模型 。每个进程都有一种错觉,认为它是机器上唯一的进程,并且它具有对操作系统服务的独占访问权。每当进程进行系统调用(即向内核发出请求,请参见第10章)时,硬件会将特权模式从用户模式更改为内核模式,并且进程开始执行具有严格限制目的的内核过程。通过这种方式,操作系统在进程的执行上下文中进行操作,以满足其请求。每当请求完全满足时,内核过程就会强制硬件返回到用户模式,并且进程从系统调用后面的指令继续执行。
Unix-like operating systems adopt a process/kernel model . Each process has the illusion that it's the only process on the machine, and it has exclusive access to the operating system services. Whenever a process makes a system call (i.e., a request to the kernel, see Chapter 10), the hardware changes the privilege mode from User Mode to Kernel Mode, and the process starts the execution of a kernel procedure with a strictly limited purpose. In this way, the operating system acts within the execution context of the process in order to satisfy its request. Whenever the request is fully satisfied, the kernel procedure forces the hardware to return to User Mode and the process continues its execution from the instruction following the system call.
如前所述,大多数 Unix 内核都是整体式的:每个内核层都集成到整个内核程序中,并代表当前进程在内核模式下运行。相比之下, 微内核操作系统需要内核提供非常小的功能集,通常包括一些同步原语、简单的调度程序和进程间通信机制。在微内核之上运行的多个系统进程实现其他操作系统层功能,例如内存分配器、设备驱动程序和系统调用处理程序。
As stated before, most Unix kernels are monolithic: each kernel layer is integrated into the whole kernel program and runs in Kernel Mode on behalf of the current process. In contrast, microkernel operating systems demand a very small set of functions from the kernel, generally including a few synchronization primitives, a simple scheduler, and an interprocess communication mechanism. Several system processes that run on top of the microkernel implement other operating system-layer functions, like memory allocators, device drivers, and system call handlers.
虽然操作系统的学术研究是面向微内核的,这样的操作系统通常比单片操作系统慢,因为在操作系统的不同层之间传递显式消息是有成本的。然而,微内核操作系统可能比单片操作系统具有一些理论上的优势。微内核迫使系统程序员采用模块化方法,因为每个操作系统层都是一个相对独立的程序,必须通过定义良好且干净的软件接口与其他层进行交互。此外,现有的微内核操作系统可以很容易地移植到其他体系结构,因为所有依赖于硬件的组件通常都封装在微内核代码中。最后,
Although academic research on operating systems is oriented toward microkernels , such operating systems are generally slower than monolithic ones, because the explicit message passing between the different layers of the operating system has a cost. However, microkernel operating systems might have some theoretical advantages over monolithic ones. Microkernels force the system programmers to adopt a modularized approach, because each operating system layer is a relatively independent program that must interact with the other layers through well-defined and clean software interfaces. Moreover, an existing microkernel operating system can be easily ported to other architectures fairly easily, because all hardware-dependent components are generally encapsulated in the microkernel code. Finally, microkernel operating systems tend to make better use of random access memory (RAM) than monolithic ones, because system processes that aren't implementing needed functionalities might be swapped out or destroyed.
为了实现微内核的许多理论优势而不引入性能损失,Linux 内核提供了 模块 。模块是一个目标文件,其代码可以在运行时链接到内核(或从内核取消链接)。目标代码通常由一组函数组成,这些函数实现文件系统、设备驱动程序或内核上层的其他功能。与微内核操作系统的外层不同,该模块不作为特定进程运行。相反,它代表当前进程在内核模式下执行,就像任何其他静态链接的内核函数一样。
To achieve many of the theoretical advantages of microkernels without introducing performance penalties, the Linux kernel offers modules . A module is an object file whose code can be linked to (and unlinked from) the kernel at runtime. The object code usually consists of a set of functions that implements a filesystem, a device driver, or other features at the kernel's upper layer. The module, unlike the external layers of microkernel operating systems, does not run as a specific process. Instead, it is executed in Kernel Mode on behalf of the current process, like any other statically linked kernel function.
使用模块的主要优点包括:
The main advantages of using modules include:
由于任何模块都可以在运行时链接和取消链接,因此系统程序员必须引入定义良好的软件接口来访问模块处理的数据结构。这使得开发新模块变得容易。
Because any module can be linked and unlinked at runtime, system programmers must introduce well-defined software interfaces to access the data structures handled by modules. This makes it easy to develop new modules.
即使模块可能依赖于某些特定的硬件功能,但它并不依赖于固定的硬件平台。例如,依赖于 SCSI 标准的磁盘驱动程序模块在 IBM 兼容 PC 上的运行效果与在 Hewlett-Packard Alpha 上的运行效果一样好。
Even if it may rely on some specific hardware features, a module doesn't depend on a fixed hardware platform. For example, a disk driver module that relies on the SCSI standard works as well on an IBM-compatible PC as it does on Hewlett-Packard's Alpha.
当需要模块的功能时,可以将模块链接到正在运行的内核;当模块不再有用时,可以取消链接;这对于小型嵌入式系统非常有用。
A module can be linked to the running kernel when its functionality is required and unlinked when it is no longer useful; this is quite useful for small embedded systems.
一旦链接进来,模块的目标代码就相当于静态链接内核的目标代码。因此,调用模块的函数时不需要显式的消息传递。[ * ]
Once linked in, the object code of a module is equivalent to the object code of the statically linked kernel. Therefore, no explicit message passing is required when the functions of the module are invoked.[*]
Unix 操作系统的设计以其文件系统为中心,它有几个有趣的特性。我们将回顾最重要的部分,因为它们将在接下来的章节中经常提到。
The Unix operating system design is centered on its filesystem, which has several interesting characteristics. We'll review the most significant ones, since they will be mentioned quite often in forthcoming chapters.
Unix 文件是一种结构为字节序列的信息容器;内核不解释文件的内容。许多编程库实现了更高级别的抽象,例如结构化为字段的记录和基于键的记录寻址。然而,这些库中的程序必须依赖于内核提供的系统调用。从用户的角度来看,文件被组织在树形结构的命名空间中,如图1-1所示。
A Unix file is an information container structured as a sequence of bytes; the kernel does not interpret the contents of a file. Many programming libraries implement higher-level abstractions, such as records structured into fields and record addressing based on keys. However, the programs in these libraries must rely on system calls offered by the kernel. From the user's point of view, files are organized in a tree-structured namespace, as shown in Figure 1-1.
树的所有节点(除了叶子)都表示目录名称。目录节点包含有关其下方的文件和目录的信息。文件或目录名由任意 ASCII 字符序列组成,[ * ](/ 和空字符 \0 除外)。大多数文件系统对文件名的长度都有限制,通常不超过 255 个字符。树的根对应的目录称为根目录。按照约定,其名称是斜杠 ( /)。同一目录中的名称必须不同,但不同目录中可以使用相同的名称。
All the nodes of the tree, except the leaves, denote directory
names. A directory node contains information about the files and
directories just beneath it. A file or directory name consists of a
sequence of arbitrary ASCII characters,[*] with the exception of / and of the null character \0.
Most filesystems place a limit on the length of a filename, typically
no more than 255 characters. The directory corresponding to the root
of the tree is called the root directory. By
convention, its name is a slash (/). Names must be different within the same
directory, but the same name may be used in different
directories.
Unix 关联当前工作目录 每个进程(参见本章后面的“进程/内核模型”部分);它属于进程执行上下文,它标识进程当前使用的目录。为了标识特定文件,该进程使用 路径名,该路径名由斜杠与通向该文件的目录名序列交替组成。如果路径名中的第一项是斜杠,则该路径名被称为 绝对路径,因为它的起点是根目录。否则,如果第一项是目录名或文件名,则路径名被称为相对路径,因为它的起点是进程的当前目录。
Unix associates a current working directory with each process (see the section "The Process/Kernel Model" later in this chapter); it belongs to the process execution context, and it identifies the directory currently used by the process. To identify a specific file, the process uses a pathname, which consists of slashes alternating with a sequence of directory names that lead to the file. If the first item in the pathname is a slash, the pathname is said to be absolute, because its starting point is the root directory. Otherwise, if the first item is a directory name or filename, the pathname is said to be relative, because its starting point is the process's current directory.
指定文件名时,符号“.” 和“..”也被使用。它们分别表示当前工作目录及其父目录。如果当前工作目录是根目录,则“.” 和“..”重合。
While specifying filenames, the notations "." and ".." are also used. They denote the current working directory and its parent directory, respectively. If the current working directory is the root directory, "." and ".." coincide.
目录中包含的文件名称为文件 硬链接,或更简单地 称为链接。同一文件可能在同一目录或不同目录中包含多个链接,因此它可能有多个文件名。
A filename included in a directory is called a file hard link, or more simply, a link. The same file may have several links included in the same directory or in different ones, so it may have several filenames.
Unix 命令:
The Unix command:
$ ln p1 p2
$ ln p1 p2
用于创建一个新的硬链接,该链接具有p2由 pathname 标识的文件的
路径名p1。
is used to create a new hard link that has the pathname p2 for a file identified by the pathname
p1.
硬链接有两个限制:
Hard links have two limitations:
It is not possible to create hard links for directories. Doing so might transform the directory tree into a graph with cycles, thus making it impossible to locate a file according to its name.
只能在同一文件系统中包含的文件之间创建链接。这是一个严重的限制,因为现代 Unix 系统可能包括位于不同磁盘和/或分区上的多个文件系统,并且用户可能不知道它们之间的物理划分。
Links can be created only among files included in the same filesystem. This is a serious limitation, because modern Unix systems may include several filesystems located on different disks and/or partitions, and users may be unaware of the physical divisions between them.
为了克服这些限制,软链接 (也称为符号链接)很久以前就被引入了。符号链接是包含另一个文件的任意路径名的短文件。路径名可以指位于任何文件系统中的任何文件或目录;它甚至可能引用一个不存在的文件。
To overcome these limitations, soft links (also called symbolic links) were introduced a long time ago. Symbolic links are short files that contain an arbitrary pathname of another file. The pathname may refer to any file or directory located in any filesystem; it may even refer to a nonexistent file.
Unix 命令:
The Unix command:
$ ln -s p1 p2
$ ln -s p1 p2
创建一个新的软链接,p2其路径名引用路径名p1。执行此命令时,文件系统提取 的目录部分,p2并在该目录中创建一个符号链接类型的新条目,名称由 指示p2。这个新文件包含由 pathname 指示的名称p1。这样,每个对 的引用都p2可以自动转换为对 的引用p1。
creates a new soft link with pathname p2 that refers to pathname p1. When this command is executed, the
filesystem extracts the directory part of p2 and creates a new entry in that directory
of type symbolic link, with the name indicated by p2. This new file contains the name
indicated by pathname p1. This way,
each reference to p2 can be
translated automatically into a reference to p1.
Unix files may have one of the following types:
常规文件
Regular file
目录
Directory
符号链接
Symbolic link
面向块的设备文件
Block-oriented device file
面向字符的设备文件
Character-oriented device file
管道和命名管道(也称为 FIFO)
Pipe and named pipe (also called FIFO)
插座
Socket
前三种文件类型是任何 Unix 文件系统的组成部分。第 18 章详细描述了它们的实现。
The first three file types are constituents of any Unix filesystem. Their implementation is described in detail in Chapter 18.
设备文件与 I/O 设备以及集成到内核中的设备驱动程序相关。例如,当程序访问设备文件时,它直接作用于与该文件关联的 I/O 设备(参见第 13 章)。
Device files are related both to I/O devices, and to device drivers integrated into the kernel. For example, when a program accesses a device file, it acts directly on the I/O device associated with that file (see Chapter 13).
管道和套接字是用于进程间通信的特殊文件(请参阅本章后面的“同步和关键区域”部分;另请参阅第 19 章)。
Pipes and sockets are special files used for interprocess communication (see the section "Synchronization and Critical Regions" later in this chapter; also see Chapter 19).
Unix 对文件的内容和文件的信息进行了明确的区分。除了设备文件和特殊文件系统的文件之外,每个文件都由字节序列组成。该文件不包含任何控制信息,例如其长度或文件结束 (EOF) 分隔符。
Unix makes a clear distinction between the contents of a file and the information about a file. With the exception of device files and files of special filesystems, each file consists of a sequence of bytes. The file does not include any control information, such as its length or an end-of-file (EOF) delimiter.
文件系统处理文件所需的所有信息都包含在称为inode的数据结构中。每个文件都有自己的索引节点,文件系统用它来识别文件。
All information needed by the filesystem to handle a file is included in a data structure called an inode. Each file has its own inode, which the filesystem uses to identify the file.
虽然不同 Unix 系统的文件系统和处理它们的内核函数可能有很大差异,但它们必须始终至少提供 POSIX 标准中指定的以下属性:
While filesystems and the kernel functions handling them can vary widely from one Unix system to another, they must always provide at least the following attributes, which are specified in the POSIX standard:
文件类型(参见上一节)
File type (see the previous section)
与文件关联的硬链接数量
Number of hard links associated with the file
文件长度(以字节为单位)
File length in bytes
设备 ID(即包含该文件的设备的标识符)
Device ID (i.e., an identifier of the device containing the file)
标识文件系统中文件的索引节点号
Inode number that identifies the file within the filesystem
文件的用户组ID
User group ID of the file
几个时间戳,指定inode状态改变时间、最后访问时间、最后修改时间
Several timestamps that specify the inode status change time, the last access time, and the last modify time
访问权限和文件模式(请参阅下一节)
Access rights and file mode (see the next section)
文件的潜在用户分为三类:
The potential users of a file fall into three classes:
作为文件所有者的用户
The user who is the owner of the file
与文件属于同一组的用户,不包括所有者
The users who belong to the same group as the file, not including the owner
所有剩余用户(其他)
All remaining users (others)
这三个类中的每一个都具有三种类型的访问权限——读、写和 执行。因此,与文件关联的访问权限集由九个不同的二进制标志组成。三个附加标志,称为 suid(设置用户 ID)、 sgid(设置组 ID)和 Sticky,定义文件模式。当应用于可执行文件时,这些标志具有以下含义:
There are three types of access rights -- read, write, and execute — for each of these three classes. Thus, the set of access rights associated with a file consists of nine different binary flags. Three additional flags, called suid (Set User ID), sgid (Set Group ID), and sticky, define the file mode. These flags have the following meanings when applied to executable files:
suidsuid执行文件的进程通常会保留用户 ID(UID) 的流程所有者。但是,如果可执行文件suid
设置了该标志,则进程将获取文件所有者的 UID。
A process executing a file normally keeps the User ID
(UID ) of the process owner. However, if the
executable file has the suid
flag set, the process gets the UID of the file owner.
sgidsgidA process executing a file keeps the user group
ID of the process group. However, if the executable
file has the sgid flag set,
the process gets the user group ID of the file.
stickysticky设置了标志的可执行文件sticky对应于对内核的请求,以在程序执行终止后将程序保留在内存中。[ * ]
An executable file with the sticky flag set corresponds to a
request to the kernel to keep the program in memory after its
execution terminates.[*]
当进程创建文件时,其所有者 ID 就是该进程的 UID。其所有者用户组ID可以是创建者进程的进程组ID,也可以是父目录的用户组ID,具体取决于sgid父目录的标志值。
When a file is created by a process, its owner ID is the UID of
the process. Its owner user group ID can be either the process group
ID of the creator process or the user group ID of the parent
directory, depending on the value of the sgid flag of the parent directory.
当用户访问常规文件或目录的内容时,他实际上访问了存储在硬件块设备中的一些数据。从这个意义上说,文件系统是硬盘分区物理组织的用户级视图。由于用户模式下的进程无法直接与低级硬件组件交互,因此每个实际的文件操作都必须在内核模式下执行。因此,Unix操作系统定义了几个与文件处理相关的系统调用。
When a user accesses the contents of either a regular file or a directory, he actually accesses some data stored in a hardware block device. In this sense, a filesystem is a user-level view of the physical organization of a hard disk partition. Because a process in User Mode cannot directly interact with the low-level hardware components, each actual file operation must be performed in Kernel Mode. Therefore, the Unix operating system defines several system calls related to file handling.
所有 Unix 内核都非常注重硬件块设备的有效处理,以实现良好的整体系统性能。在接下来的章节中,我们将描述与 Linux 中的文件处理相关的主题,特别是内核如何对与文件相关的系统调用做出反应。要理解这些描述,您需要了解主要文件处理系统调用的使用方式;这些将在下一节中描述。
All Unix kernels devote great attention to the efficient handling of hardware block devices to achieve good overall system performance. In the chapters that follow, we will describe topics related to file handling in Linux and specifically how the kernel reacts to file-related system calls. To understand those descriptions, you will need to know how the main file-handling system calls are used; these are described in the next section.
进程只能访问“打开”的文件。要打开文件,该进程调用系统调用:
Processes can access only "opened" files. To open a file, the process invokes the system call:
fd = open(路径、标志、模式)
fd = open(path, flag, mode)
三个参数的含义如下:
The three parameters have the following meanings:
pathpath表示要打开的文件的路径名(相对或绝对)。
Denotes the pathname (relative or absolute) of the file to be opened.
flagflag指定必须如何打开文件(例如,读、写、读/写、附加)。它还可以指定是否应创建不存在的文件。
Specifies how the file must be opened (e.g., read, write, read/write, append). It also can specify whether a nonexisting file should be created.
modemode指定新创建的文件的访问权限。
Specifies the access rights of a newly created file.
该系统调用创建一个“打开文件”对象并返回一个称为文件描述符的标识符。一个打开的文件对象包含:
This system call creates an "open file" object and returns an identifier called a file descriptor. An open file object contains:
一些文件处理数据结构,例如一组指定如何打开文件的标志,一个offset表示文件中下一个操作将发生的当前位置的字段(所谓的文件指针),等等在。
Some file-handling data structures, such as a set of flags
specifying how the file has been opened, an offset field that denotes the current
position in the file from which the next operation will take
place (the so-called file pointer), and so
on.
一些指向进程可以调用的内核函数的指针。允许的功能集取决于flag参数的值。
Some pointers to kernel functions that the process can
invoke. The set of permitted functions depends on the value of
the flag parameter.
我们将在第 12 章中详细讨论打开的文件对象。我们在这里只描述 POSIX 语义指定的一些通用属性。
We discuss open file objects in detail in Chapter 12. Let's limit ourselves here to describing some general properties specified by the POSIX semantics.
文件描述符表示进程和打开的文件之间的交互,而打开的文件对象包含与该交互相关的数据。同一个打开的文件对象可能由多个文件描述符标识在同一过程中。
A file descriptor represents an interaction between a process and an opened file, while an open file object contains data related to that interaction. The same open file object may be identified by several file descriptors in the same process.
多个进程可能同时打开同一个文件。在这种情况下,文件系统为每个文件分配一个单独的文件描述符以及一个单独的打开文件对象。发生这种情况时,Unix 文件系统不会在同一文件上的进程发出的 I/O 操作之间提供任何类型的同步。然而,有几个系统调用flock( )可以允许进程在整个文件或部分文件上进行自身同步(参见第 12 章)。
Several processes may concurrently open the same file. In
this case, the filesystem assigns a separate file descriptor to
each file, along with a separate open file object. When this
occurs, the Unix filesystem does not provide any kind of
synchronization among the I/O operations issued by the processes
on the same file. However, several system calls such as flock( ) are available to allow
processes to synchronize themselves on the entire file or on
portions of it (see Chapter
12).
为了创建一个新文件,进程还可以调用creat( )系统调用,该系统调用由内核处理,与open(
).
To create a new file, the process also may invoke the creat( ) system call, which is handled by
the kernel exactly like open(
).
常规 Unix 文件可以按顺序或随机方式寻址,而设备文件和命名管道通常按顺序访问。在这两种访问中,内核都会将文件指针存储在打开的文件对象中,即下一次读或写操作将发生的当前位置。
Regular Unix files can be addressed either sequentially or randomly, while device files and named pipes are usually accessed sequentially. In both kinds of access, the kernel stores the file pointer in the open file object — that is, the current position at which the next read or write operation will take place.
隐含地假定顺序访问:read( )和write( ) 系统调用始终引用当前文件指针的位置。要修改该值,程序必须显式调用lseek( ) 系统调用。当一个文件被打开时,内核将文件指针设置为文件中第一个字节的位置(偏移量0)。
Sequential access is implicitly assumed: the read( ) and write( ) system calls always refer to the position of the
current file pointer. To modify the value, a program must explicitly
invoke the lseek( ) system call. When a file is opened, the kernel sets
the file pointer to the position of the first byte in the file
(offset 0).
系统lseek( )调用需要以下参数:
The lseek( ) system call
requires the following parameters:
newoffset = lseek(fd, 偏移量, 来源);
newoffset = lseek(fd, offset, whence);
其含义如下:
which have the following meanings:
fdfd表示打开的文件的文件描述符
Indicates the file descriptor of the opened file
offsetoffset指定一个有符号整数值,用于计算文件指针的新位置
Specifies a signed integer value that will be used for computing the new position of the file pointer
whencewhence指定是否应通过将offset值与数字 0(距文件开头的偏移量)、当前文件指针或最后一个字节的位置(距文件末尾的偏移量)相加来计算新位置
Specifies whether the new position should be computed by
adding the offset value to
the number 0 (offset from the beginning of the file), the
current file pointer, or the position of the last byte (offset
from the end of the file)
The read( ) system call requires the following parameters:
nread = 读取(fd, buf, 计数);
nread = read(fd, buf, count);
其含义如下:
which have the following meanings:
fdfd表示打开的文件的文件描述符
Indicates the file descriptor of the opened file
bufbuf指定进程地址空间中数据将传输到的缓冲区的地址
Specifies the address of the buffer in the process's address space to which the data will be transferred
countcount表示要读取的字节数
Denotes the number of bytes to read
当处理这样的系统调用时,内核尝试从打开的文件的偏移字段的当前值开始,从具有文件描述符的文件中读取
count字节。fd在某些情况下(文件结尾、空管道等),内核无法成功读取所有count字节。返回nread值指定有效读取的字节数。文件指针也会通过添加nread到其先前的值来更新。参数write( )
相似。
When handling such a system call, the kernel attempts to read
count bytes from the file having
the file descriptor fd, starting
from the current value of the opened file's offset field. In some
cases—end-of-file, empty pipe, and so on—the kernel does not succeed
in reading all count bytes. The
returned nread value specifies
the number of bytes effectively read. The file pointer also is
updated by adding nread to its
previous value. The write( )
parameters are similar.
When a process does not need to access the contents of a file anymore, it can invoke the system call:
res = 关闭(fd);
res = close(fd);
它释放与文件描述符对应的打开文件对象fd。当进程终止时,内核会关闭所有剩余打开的文件。
which releases the open file object corresponding to the file
descriptor fd. When a process
terminates, the kernel closes all its remaining opened files.
要重命名或删除文件,进程不需要打开它。事实上,此类操作并不作用于受影响文件的内容,而是作用于一个或多个目录的内容。例如系统调用:
To rename or delete a file, a process does not need to open it. Indeed, such operations do not act on the contents of the affected file, but rather on the contents of one or more directories. For example, the system call:
res = 重命名(旧路径,新路径);
res = rename(oldpath, newpath);
更改文件链接的名称,而系统调用:
changes the name of a file link, while the system call:
res = 取消链接(路径名);
res = unlink(pathname);
减少文件链接计数并删除相应的目录条目。仅当链接计数为 0 时,才会删除该文件。
decreases the file link count and removes the corresponding directory entry. The file is deleted only when the link count assumes the value 0.
Unix 内核提供了应用程序可以运行的执行环境。因此,内核必须实现一套服务和相应的接口。应用程序使用这些接口,通常不直接与硬件资源交互。
Unix kernels provide an execution environment in which applications may run. Therefore, the kernel must implement a set of services and corresponding interfaces. Applications use those interfaces and do not usually interact directly with hardware resources.
正如已经提到的,CPU 可以运行在任一用户模式或内核模式。实际上,一些CPU可以有两个以上的执行状态。例如,80 × 86 微处理器有四种不同的执行状态。但所有标准 Unix 内核仅使用内核模式和用户模式。
As already mentioned, a CPU can run in either User Mode or Kernel Mode . Actually, some CPUs can have more than two execution states. For instance, the 80 × 86 microprocessors have four different execution states. But all standard Unix kernels use only Kernel Mode and User Mode.
当程序在用户态执行时,它不能直接访问内核数据结构或内核程序。然而,当应用程序在内核模式下执行时,这些限制就不再适用。每个 CPU 型号都提供特殊指令来从用户模式切换到内核模式,反之亦然。程序通常在用户态下执行,只有在请求内核提供的服务时才切换到内核态。当内核满足了程序的请求后,它会将程序放回用户模式。
When a program is executed in User Mode, it cannot directly access the kernel data structures or the kernel programs. When an application executes in Kernel Mode, however, these restrictions no longer apply. Each CPU model provides special instructions to switch from User Mode to Kernel Mode and vice versa. A program usually executes in User Mode and switches to Kernel Mode only when requesting a service provided by the kernel. When the kernel has satisfied the program's request, it puts the program back in User Mode.
进程是动态实体,通常在系统中具有有限的生命周期。创建、消除和同步现有流程的任务被委托给内核中的一组例程。
Processes are dynamic entities that usually have a limited life span within the system. The task of creating, eliminating, and synchronizing the existing processes is delegated to a group of routines in the kernel.
内核本身不是一个进程,而是一个进程管理器。进程/内核模型假设需要内核服务的进程使用称为系统调用的特定编程结构 。每个系统调用都会设置一组标识进程请求的参数,然后执行与硬件相关的 CPU 指令以从用户模式切换到内核模式。
The kernel itself is not a process but a process manager. The process/kernel model assumes that processes that require a kernel service use specific programming constructs called system calls . Each system call sets up the group of parameters that identifies the process request and then executes the hardware-dependent CPU instruction to switch from User Mode to Kernel Mode.
除了用户进程之外,Unix 系统还包括一些称为内核线程的特权进程 具有以下特点:
Besides user processes, Unix systems include a few privileged processes called kernel threads with the following characteristics:
它们在内核地址空间中以内核模式运行。
They run in Kernel Mode in the kernel address space.
它们不与用户交互,因此不需要终端设备。
They do not interact with users, and thus do not require terminal devices.
它们通常在系统启动期间创建并保持活动状态直到系统关闭。
They are usually created during system startup and remain alive until the system is shut down.
在单处理器系统上,一次只有一个进程运行,并且它可以在用户模式或内核模式下运行。如果它在内核模式下运行,则处理器正在执行某些内核例程。图 1-2说明了用户模式和内核模式之间转换的示例。用户模式下的进程 1 发出系统调用,之后该进程切换到内核模式,并为系统调用提供服务。然后,进程 1 在用户模式下恢复执行,直到发生定时器中断,并且调度程序在内核模式下被激活。发生进程切换,进程 2 开始在用户模式下执行,直到硬件设备发出中断。作为中断的结果,进程 2 切换到内核模式并处理中断。
On a uniprocessor system, only one process is running at a time, and it may run either in User or in Kernel Mode. If it runs in Kernel Mode, the processor is executing some kernel routine. Figure 1-2 illustrates examples of transitions between User and Kernel Mode. Process 1 in User Mode issues a system call, after which the process switches to Kernel Mode, and the system call is serviced. Process 1 then resumes execution in User Mode until a timer interrupt occurs, and the scheduler is activated in Kernel Mode. A process switch takes place, and Process 2 starts its execution in User Mode until a hardware device raises an interrupt. As a consequence of the interrupt, Process 2 switches to Kernel Mode and services the interrupt.
Unix 内核的作用远不止处理系统调用;事实上,内核例程可以通过多种方式激活:
Unix kernels do much more than handle system calls; in fact, kernel routines can be activated in several ways:
进程调用系统调用。
A process invokes a system call.
执行进程的CPU发出 异常信号,这是一种异常情况,例如无效指令。内核代表引发该异常的进程来处理该异常。
The CPU executing the process signals an exception, which is an unusual condition such as an invalid instruction. The kernel handles the exception on behalf of the process that caused it.
外围设备向 CPU 发出中断 信号,以通知其某个事件,例如请求注意、状态更改或 I/O 操作完成。每个中断信号都由称为 中断处理程序的内核程序处理。由于外围设备相对于 CPU 异步运行,因此中断会在不可预测的时间发生。
A peripheral device issues an interrupt signal to the CPU to notify it of an event such as a request for attention, a status change, or the completion of an I/O operation. Each interrupt signal is dealt by a kernel program called an interrupt handler. Because peripheral devices operate asynchronously with respect to the CPU, interrupts occur at unpredictable times.
执行一个内核线程。因为它运行在内核模式下,所以相应的程序必须被视为内核的一部分。
A kernel thread is executed. Because it runs in Kernel Mode, the corresponding program must be considered part of the kernel.
为了让内核管理进程,每个进程都由一个进程描述符表示,其中包含有关进程当前状态的信息。
To let the kernel manage processes, each process is represented by a process descriptor that includes information about the current state of the process.
当内核停止进程的执行时,它将几个处理器寄存器的当前内容保存在进程描述符中。这些包括:
When the kernel stops the execution of a process, it saves the current contents of several processor registers in the process descriptor. These include:
程序计数器 (PC) 和堆栈指针 (SP) 寄存器
The program counter (PC) and stack pointer (SP) registers
通用寄存器
The general purpose registers
浮点寄存器
The floating point registers
处理器控制寄存器(处理器状态字)包含有关 CPU 状态的信息
The processor control registers (Processor Status Word) containing information about the CPU state
内存管理寄存器用于跟踪进程访问的 RAM
The memory management registers used to keep track of the RAM accessed by the process
当内核决定恢复执行进程时,它会使用正确的进程描述符字段来加载 CPU 寄存器。由于程序计数器的存储值指向最后执行的指令之后的指令,因此进程会从停止的位置恢复执行。
When the kernel decides to resume executing a process, it uses the proper process descriptor fields to load the CPU registers. Because the stored value of the program counter points to the instruction following the last instruction executed, the process resumes execution at the point where it was stopped.
当进程不在 CPU 上执行时,它正在等待某个事件。Unix内核区分了很多等待状态,这些状态通常是通过进程描述符的队列来实现的; 每个(可能为空)队列对应于等待特定事件的进程集。
When a process is not executing on the CPU, it is waiting for some event. Unix kernels distinguish many wait states, which are usually implemented by queues of process descriptors ; each (possibly empty) queue corresponds to the set of processes waiting for a specific event.
所有 Unix 内核都是可重入的。这意味着多个进程可能同时在内核模式下执行。当然,在单处理器系统上,只有一个进程可以进行,但许多进程在等待 CPU 或完成某些 I/O 操作时可能会在内核模式下被阻塞。例如,在代表进程向磁盘发出读取操作后,内核让磁盘控制器处理它并恢复执行其他进程。当设备满足读取时,中断会通知内核,以便之前的进程可以恢复执行。
All Unix kernels are reentrant. This means that several processes may be executing in Kernel Mode at the same time. Of course, on uniprocessor systems, only one process can progress, but many can be blocked in Kernel Mode when waiting for the CPU or the completion of some I/O operation. For instance, after issuing a read to a disk on behalf of a process, the kernel lets the disk controller handle it and resumes executing other processes. An interrupt notifies the kernel when the device has satisfied the read, so the former process can resume the execution.
提供可重入性的一种方法是编写函数,使其仅修改局部变量而不改变全局数据结构。此类函数称为可重入函数 。但可重入内核不仅限于此类可重入函数(尽管这是某些实时内核的实现方式)。相反,内核可以包含不可重入函数并使用锁定机制来确保一次只有一个进程可以执行不可重入函数。
One way to provide reentrancy is to write functions so that they modify only local variables and do not alter global data structures. Such functions are called reentrant functions . But a reentrant kernel is not limited only to such reentrant functions (although that is how some real-time kernels are implemented). Instead, the kernel can include nonreentrant functions and use locking mechanisms to ensure that only one process can execute a nonreentrant function at a time.
如果发生硬件中断,可重入内核能够挂起当前正在运行的进程,即使该进程处于内核模式也是如此。此功能非常重要,因为它提高了发出中断的设备控制器的吞吐量。一旦设备发出中断,它就会等待,直到 CPU 确认它。如果内核能够快速响应,设备控制器将能够在 CPU 处理中断的同时执行其他任务。
If a hardware interrupt occurs, a reentrant kernel is able to suspend the current running process even if that process is in Kernel Mode. This capability is very important, because it improves the throughput of the device controllers that issue interrupts. Once a device has issued an interrupt, it waits until the CPU acknowledges it. If the kernel is able to answer quickly, the device controller will be able to perform other tasks while the CPU handles the interrupt.
现在让我们看看内核重入及其对内核组织的影响。内核控制路径 表示内核为处理系统调用、异常或中断而执行的指令序列。
Now let's look at kernel reentrancy and its impact on the organization of the kernel. A kernel control path denotes the sequence of instructions executed by the kernel to handle a system call, an exception, or an interrupt.
在最简单的情况下,CPU从第一条指令到最后一条指令顺序执行内核控制路径。但是,当发生以下事件之一时,CPU 会交错内核控制路径:
In the simplest case, the CPU executes a kernel control path sequentially from the first instruction to the last. When one of the following events occurs, however, the CPU interleaves the kernel control paths :
在用户态下执行的进程调用系统调用,相应的内核控制路径验证该请求不能立即得到满足;然后它调用调度程序来选择要运行的新进程。结果,发生了进程切换。第一个内核控制路径未完成,CPU 恢复执行其他一些内核控制路径。在这种情况下,两个控制路径代表两个不同的进程执行。
A process executing in User Mode invokes a system call, and the corresponding kernel control path verifies that the request cannot be satisfied immediately; it then invokes the scheduler to select a new process to run. As a result, a process switch occurs. The first kernel control path is left unfinished, and the CPU resumes the execution of some other kernel control path. In this case, the two control paths are executed on behalf of two different processes.
CPU 在运行内核控制路径时检测到异常,例如访问 RAM 中不存在的页面。第一控制路径被挂起,CPU开始执行合适的程序。在我们的示例中,此类过程可以为进程分配一个新页面并从磁盘读取其内容。当该过程终止时,可以恢复第一控制路径。在这种情况下,两个控制路径代表同一进程执行。
The CPU detects an exception—for example, access to a page not present in RAM—while running a kernel control path. The first control path is suspended, and the CPU starts the execution of a suitable procedure. In our example, this type of procedure can allocate a new page for the process and read its contents from disk. When the procedure terminates, the first control path can be resumed. In this case, the two control paths are executed on behalf of the same process.
当 CPU 在启用中断的情况下运行内核控制路径时,会发生硬件中断。第一个内核控制路径未完成,CPU 开始处理另一个内核控制路径来处理中断。当中断处理程序终止时,第一个内核控制路径将恢复。在这种情况下,两个内核控制路径运行在同一进程的执行上下文中,总的系统CPU时间也被计算在内。然而,中断处理程序不一定代表进程进行操作。
A hardware interrupt occurs while the CPU is running a kernel control path with the interrupts enabled. The first kernel control path is left unfinished, and the CPU starts processing another kernel control path to handle the interrupt. The first kernel control path resumes when the interrupt handler terminates. In this case, the two kernel control paths run in the execution context of the same process, and the total system CPU time is accounted to it. However, the interrupt handler doesn't necessarily operate on behalf of the process.
当 CPU 在启用内核抢占的情况下运行时,会发生中断,并且可以运行更高优先级的进程。在这种情况下,第一个内核控制路径未完成,CPU 代表较高优先级进程继续执行另一个内核控制路径。仅当内核已编译为支持内核抢占时才会发生这种情况。
An interrupt occurs while the CPU is running with kernel preemption enabled, and a higher priority process is runnable. In this case, the first kernel control path is left unfinished, and the CPU resumes executing another kernel control path on behalf of the higher priority process. This occurs only if the kernel has been compiled with kernel preemption support.
图 1-3 说明了非交错和交错内核控制路径的几个示例。考虑三种不同的 CPU 状态:
Figure 1-3 illustrates a few examples of noninterleaved and interleaved kernel control paths. Three different CPU states are considered:
每个进程都在其私有地址空间中运行。在用户模式下运行的进程指的是私有堆栈、数据和代码区域。当在内核模式下运行时,进程寻址内核数据和代码区域并使用另一个私有堆栈。
Each process runs in its private address space. A process running in User Mode refers to private stack, data, and code areas. When running in Kernel Mode, the process addresses the kernel data and code areas and uses another private stack.
由于内核是可重入的,因此可以依次执行多个内核控制路径(每个路径与不同的进程相关)。在这种情况下,每个内核控制路径都引用其自己的私有内核堆栈。
Because the kernel is reentrant, several kernel control paths—each related to a different process—may be executed in turn. In this case, each kernel control path refers to its own private kernel stack.
虽然每个进程似乎都可以访问私有地址空间,但有时部分地址空间会在进程之间共享。在某些情况下,进程明确请求这种共享;在其他情况下,它是由内核自动完成的,以减少内存使用。
While it appears to each process that it has access to a private address space, there are times when part of the address space is shared among processes. In some cases, this sharing is explicitly requested by processes; in others, it is done automatically by the kernel to reduce memory usage.
如果多个用户同时需要同一个程序(例如编辑器),则该程序仅加载到内存中一次,并且其指令可以由所有需要它的用户共享。当然,它的数据不能共享,因为每个用户都会有单独的数据。这种共享地址空间是由内核自动完成的,以节省内存。
If the same program, say an editor, is needed simultaneously by several users, the program is loaded into memory only once, and its instructions can be shared by all of the users who need it. Its data, of course, must not be shared, because each user will have separate data. This kind of shared address space is done automatically by the kernel to save memory.
进程还可以使用 System V 中引入的“共享内存”技术来共享部分地址空间,作为一种进程间通信并受Linux支持。
Processes also can share parts of their address space as a kind of interprocess communication, using the "shared memory" technique introduced in System V and supported by Linux.
最后,Linux 支持mmap(
) 系统调用,它允许将文件的一部分或存储在块设备上的信息映射到进程地址空间的一部分。内存映射可以为传输数据提供正常读写的替代方案。如果同一个文件被多个进程共享,则其内存映射包含在共享它的每个进程的地址空间中。
Finally, Linux supports the mmap(
) system call, which allows part of a file or the
information stored on a block device to be mapped into a part of a
process address space. Memory mapping can provide an alternative to
normal reads and writes for transferring data. If the same file is
shared by several processes, its memory mapping is included in the
address space of each of the processes that share it.
实现可重入内核需要使用同步。如果内核控制路径在作用于内核数据结构时被挂起,则不应允许其他内核控制路径作用于同一数据结构,除非它已重置为一致状态。否则,两个控制路径的交互可能会破坏存储的信息。
Implementing a reentrant kernel requires the use of synchronization . If a kernel control path is suspended while acting on a kernel data structure, no other kernel control path should be allowed to act on the same data structure unless it has been reset to a consistent state. Otherwise, the interaction of the two control paths could corrupt the stored information.
例如,假设全局变量 V 包含某些系统资源的可用项数。第一个内核控制路径 A 读取变量并确定只有一个可用项。此时,另一个内核控制路径 B 被激活并读取相同的变量,该变量仍然包含值 1。因此,B 减少 V 并开始使用资源项。然后A继续执行;因为它已经读取了 V 的值,所以它假设它可以减少 V 并获取 B 已经使用的资源项。最终结果是,V 包含 -1,并且两个内核控制路径使用相同的资源项,可能会产生灾难性的影响。
For example, suppose a global variable V contains the number of available items of some system resource. The first kernel control path, A, reads the variable and determines that there is just one available item. At this point, another kernel control path, B, is activated and reads the same variable, which still contains the value 1. Thus, B decreases V and starts using the resource item. Then A resumes the execution; because it has already read the value of V, it assumes that it can decrease V and take the resource item, which B already uses. As a final result, V contains -1, and two kernel control paths use the same resource item with potentially disastrous effects.
当计算结果取决于两个或多个进程的调度方式时,代码是不正确的。我们说存在 竞争条件。
When the outcome of a computation depends on how two or more processes are scheduled, the code is incorrect. We say that there is a race condition.
一般来说,通过使用 原子操作来确保对全局变量的安全访问 。在前面的示例中,如果两个控制路径通过单个不可中断操作读取并减小 V,则不可能出现数据损坏。然而,内核包含许多无法通过单个操作访问的数据结构。例如,通常不可能通过单个操作从链表中删除元素,因为内核需要同时访问至少两个指针。在另一个进程可以进入它之前,每个启动它的进程应该完成的任何代码部分称为临界区域。[ * ]
In general, safe access to a global variable is ensured by using atomic operations . In the previous example, data corruption is not possible if the two control paths read and decrease V with a single, noninterruptible operation. However, kernels contain many data structures that cannot be accessed with a single operation. For example, it usually isn't possible to remove an element from a linked list with a single operation, because the kernel needs to access at least two pointers at once. Any section of code that should be finished by each process that begins it before another process can enter it is called a critical region.[*]
这些问题不仅发生在内核控制路径之间,而且还发生在共享公共数据的进程之间。已采用多种同步技术。以下部分重点介绍如何同步内核控制路径。
These problems occur not only among kernel control paths but also among processes sharing common data. Several synchronization techniques have been adopted. The following section concentrates on how to synchronize kernel control paths.
为了给同步问题提供一个极其简单的解决方案,一些传统的 Unix 内核是非抢占式的:当一个进程在内核模式下执行时,它不能被任意挂起并被另一个进程替换。因此,在单处理器系统上,所有不被中断或异常处理程序更新的内核数据结构内核可以安全地访问。
To provide a drastically simple solution to synchronization problems, some traditional Unix kernels are nonpreemptive: when a process executes in Kernel Mode, it cannot be arbitrarily suspended and substituted with another process. Therefore, on a uniprocessor system, all kernel data structures that are not updated by interrupts or exception handlers are safe for the kernel to access.
当然,内核态的进程可以主动放弃CPU,但在这种情况下,它必须确保所有数据结构保持一致的状态。此外,当它恢复执行时,它必须重新检查任何先前访问的可能更改的数据结构的值。
Of course, a process in Kernel Mode can voluntarily relinquish the CPU, but in this case, it must ensure that all data structures are left in a consistent state. Moreover, when it resumes its execution, it must recheck the value of any previously accessed data structures that could be changed.
适用于抢占式内核的同步机制包括禁用进入关键区域之前进行内核抢占,并在离开该区域后立即重新启用它。
A synchronization mechanism applicable to preemptive kernels consists of disabling kernel preemption before entering a critical region and reenabling it right after leaving the region.
对于多处理器系统来说,不可抢占性是不够的,因为运行在不同CPU上的两个内核控制路径可以同时访问相同的数据结构。
Nonpreemptability is not enough for multiprocessor systems, because two kernel control paths running on different CPUs can concurrently access the same data structure.
单处理器系统的另一种同步机制包括在进入关键区域之前禁用所有硬件中断,并在离开关键区域后立即重新启用它们。这种机制虽然简单,但远非最佳。如果临界区域很大,中断可能会在相对较长的时间内保持禁用状态,从而可能导致所有硬件活动冻结。
Another synchronization mechanism for uniprocessor systems consists of disabling all hardware interrupts before entering a critical region and reenabling them right after leaving it. This mechanism, while simple, is far from optimal. If the critical region is large, interrupts can remain disabled for a relatively long time, potentially causing all hardware activities to freeze.
此外,在多处理器系统上,禁用本地CPU上的中断是不够的,必须使用其他同步技术。
Moreover, on a multiprocessor system, disabling interrupts on the local CPU is not sufficient, and other synchronization techniques must be used.
一种广泛使用的机制,在单处理器和多处理器系统中都有效,依赖于信号量的使用 。信号量只是与数据结构关联的计数器;所有内核线程在尝试访问数据结构之前都会检查它。每个信号量可以被视为一个由以下部分组成的对象:
A widely used mechanism, effective in both uniprocessor and multiprocessor systems, relies on the use of semaphores . A semaphore is simply a counter associated with a data structure; it is checked by all kernel threads before they try to access the data structure. Each semaphore may be viewed as an object composed of:
整型变量
An integer variable
等待进程列表
A list of waiting processes
两种原子方法:down(
)和up( )
Two atomic methods: down(
) and up( )
该down( )方法降低信号量的值。如果新值小于0,则该方法将正在运行的进程添加到信号量列表中,然后阻塞(即调用调度程序)。该up( )方法增加信号量的值,如果其新值大于或等于 0,则重新激活信号量列表中的一个或多个进程。
The down( ) method
decreases the value of the semaphore. If the new value is less than
0, the method adds the running process to the semaphore list and
then blocks (i.e., invokes the scheduler). The up( ) method increases the value of the
semaphore and, if its new value is greater than or equal to 0,
reactivates one or more processes in the semaphore list.
每个要保护的数据结构都有自己的信号量,信号量被初始化为1。当内核控制路径希望访问该数据结构时,它会在down( )适当的信号量上执行方法。如果新信号量的值不为负,则授予对数据结构的访问权限。否则,正在执行内核控制路径的进程将被添加到信号量列表中并被阻止。当另一个进程在该信号量上执行该up(
)方法时,信号量列表中的进程之一被允许继续进行。
Each data structure to be protected has its own semaphore,
which is initialized to 1. When a kernel control path wishes to
access the data structure, it executes the down( ) method on the proper semaphore. If
the value of the new semaphore isn't negative, access to the data
structure is granted. Otherwise, the process that is executing the
kernel control path is added to the semaphore list and blocked. When
another process executes the up(
) method on that semaphore, one of the processes in the
semaphore list is allowed to proceed.
在多处理器系统中,信号量并不总是同步问题的最佳解决方案。应保护某些内核数据结构,防止运行在不同 CPU 上的内核控制路径同时访问。在这种情况下,如果更新数据结构所需的时间很短,则信号量的效率可能非常低。要检查信号量,内核必须在信号量列表中插入一个进程,然后挂起它。由于这两个操作都相对昂贵,因此在完成它们所需的时间内,另一个内核控制路径可能已经释放了信号量。
In multiprocessor systems, semaphores are not always the best solution to the synchronization problems. Some kernel data structures should be protected from being concurrently accessed by kernel control paths that run on different CPUs. In this case, if the time required to update the data structure is short, a semaphore could be very inefficient. To check a semaphore, the kernel must insert a process in the semaphore list and then suspend it. Because both operations are relatively expensive, in the time it takes to complete them, the other kernel control path could have already released the semaphore.
在这些情况下,多处理器操作系统使用 自旋锁 。自旋锁与信号量非常相似,但它没有进程列表;当一个进程发现另一个进程关闭了锁时,它会反复“旋转”,执行紧密的指令循环,直到锁打开。
In these cases, multiprocessor operating systems use spin locks . A spin lock is very similar to a semaphore, but it has no process list; when a process finds the lock closed by another process, it "spins" around repeatedly, executing a tight instruction loop until the lock becomes open.
当然,自旋锁在单处理器环境中是没有用的。当内核控制路径尝试访问锁定的数据结构时,它会启动无限循环。因此,正在更新受保护数据结构的内核控制路径将没有机会继续执行并释放自旋锁。最终的结果就是系统挂起。
Of course, spin locks are useless in a uniprocessor environment. When a kernel control path tries to access a locked data structure, it starts an endless loop. Therefore, the kernel control path that is updating the protected data structure would not have a chance to continue the execution and release the spin lock. The final result would be that the system hangs.
与其他控制路径同步的进程或内核控制路径很容易进入 死锁状态。最简单的死锁情况发生在进程p1获得对数据结构a 的访问权并且进程p2 获得对b 的访问权,但p1 等待b且p2 等待a时。进程组之间还可能发生其他更复杂的循环等待。当然,死锁情况会导致受影响的进程或内核控制路径完全冻结。
Processes or kernel control paths that synchronize with other control paths may easily enter a deadlock state. The simplest case of deadlock occurs when process p1 gains access to data structure a and process p2 gains access to b, but p1 then waits for b and p2 waits for a. Other more complex cyclic waits among groups of processes also may occur. Of course, a deadlock condition causes a complete freeze of the affected processes or kernel control paths.
就内核设计而言,当使用的内核锁数量较多时,死锁就会成为一个问题。在这种情况下,可能很难确保交错内核控制路径的所有可能方式都不会达到死锁状态。包括 Linux 在内的多种操作系统通过按预定义的顺序请求锁定来避免此问题。
As far as kernel design is concerned, deadlocks become an issue when the number of kernel locks used is high. In this case, it may be quite difficult to ensure that no deadlock state will ever be reached for all possible ways to interleave kernel control paths. Several operating systems, including Linux, avoid this problem by requesting locks in a predefined order.
Unix信号 提供一种向进程通知系统事件的机制。每个事件都有自己的信号编号,通常由符号常量(例如 )来引用SIGTERM。系统事件有两种:
Unix signals provide a mechanism for notifying processes of system
events. Each event has its own signal number, which is usually
referred to by a symbolic constant such as SIGTERM. There are two kinds of system
events:
例如,用户可以
SIGINT通过在终端上按中断键代码(通常是 Ctrl-C)将中断信号发送到前台进程。
For instance, a user can send the interrupt signal
SIGINT to a foreground
process by pressing the interrupt keycode (usually Ctrl-C) at
the terminal.
SIGSEGV例如,当进程访问无效地址的内存位置时,内核会向进程发送信号。
For instance, the kernel sends the signal SIGSEGV to a process when it accesses
a memory location at an invalid address.
POSIX 标准定义了大约 20 种不同的信号,其中 2 种是用户可定义的,可用作用户模式下进程之间通信和同步的原始机制。一般来说,进程可以通过两种可能的方式对信号传递做出反应:
The POSIX standard defines about 20 different signals, 2 of which are user-definable and may be used as a primitive mechanism for communication and synchronization among processes in User Mode. In general, a process may react to a signal delivery in two possible ways:
忽略信号。
Ignore the signal.
异步执行指定的过程(信号处理程序)。
Asynchronously execute a specified procedure (the signal handler).
如果进程未指定这些替代方案之一,则内核将执行取决于信号编号的默认操作。五种可能的默认操作是:
If the process does not specify one of these alternatives, the kernel performs a default action that depends on the signal number. The five possible default actions are:
终止该进程。
Terminate the process.
将执行上下文和地址空间的内容写入文件(核心转储)并终止进程。
Write the execution context and the contents of the address space in a file (core dump) and terminate the process.
忽略信号。
Ignore the signal.
暂停该进程。
Suspend the process.
如果进程已停止,则恢复该进程的执行。
Resume the process's execution, if it was stopped.
内核信号处理相当复杂,因为 POSIX 语义允许进程暂时阻止信号。而且,
SIGKILL和SIGSTOP信号不能被进程直接处理或忽略。
Kernel signal handling is rather elaborate, because the POSIX
semantics allows processes to temporarily block signals. Moreover, the
SIGKILL and SIGSTOP signals cannot be directly handled
by the process or ignored.
AT&T 的 Unix 系统 V介绍了用户模式下进程间的其他类型的进程间通信,这些通信已被许多 Unix 内核采用:信号量 、消息队列 ,和共享内存 。它们统称为System V IPC。
AT&T's Unix System V introduced other kinds of interprocess communication among processes in User Mode, which have been adopted by many Unix kernels: semaphores , message queues , and shared memory . They are collectively known as System V IPC.
内核将这些结构实现为IPC 资源。进程通过调用获取资源
shmget( ) ,semget( )
, 或者msgget( )
系统调用。就像文件一样,IPC 资源是持久性的:它们必须由创建者进程、当前所有者或超级用户进程显式释放。
The kernel implements these constructs as IPC
resources. A process acquires a resource by invoking a
shmget( ) , semget( )
, or msgget( )
system call. Just like files, IPC resources are
persistent: they must be explicitly deallocated by the creator
process, by the current owner, or by a superuser process.
信号量与本章前面的“同步和关键区域”部分中描述的信号量类似,只是它们是为用户模式下的进程保留的。消息队列允许进程通过使用以下方式交换消息msgsnd(
) 和msgrcv( )
系统调用,分别将消息插入到特定的消息队列中并从中提取消息。
Semaphores are similar to those described in the section "Synchronization and Critical
Regions," earlier in this chapter, except that they are
reserved for processes in User Mode. Message queues allow processes to
exchange messages by using the msgsnd(
) and msgrcv( )
system calls, which insert a message into a specific
message queue and extract a message from it, respectively.
POSIX标准(IEEE Std 1003.1-2001)定义了一种基于消息队列的IPC机制,通常称为 POSIX消息队列 。它们类似于 System V IPC 的消息队列,但它们具有更简单的基于文件的应用程序接口。
The POSIX standard (IEEE Std 1003.1-2001) defines an IPC mechanism based on message queues, which is usually known as POSIX message queues . They are similar to the System V IPC's message queues, but they have a much simpler file-based interface to the applications.
共享内存为进程交换和共享数据提供了最快的方式。进程首先发出shmget( )系统调用来创建具有所需大小的新共享内存。进程获得IPC资源标识符后,调用shmat(
) 系统调用,返回进程地址空间中新区域的起始地址。当进程希望从其地址空间分离共享内存时,它会调用
shmdt( ) 系统调用。共享内存的实现取决于内核如何实现进程地址空间。
Shared memory provides the fastest way for processes to exchange
and share data. A process starts by issuing a shmget( ) system call to create a new shared
memory having a required size. After obtaining the IPC resource
identifier, the process invokes the shmat(
) system call, which returns the starting address of the
new region within the process address space. When the process wishes
to detach the shared memory from its address space, it invokes the
shmdt( ) system call. The implementation of shared memory
depends on how the kernel implements process address spaces.
Unix 对进程和它正在执行的程序进行了明确的区分。为此,fork( ) 和_exit( )
系统调用分别用于创建新进程和终止它,而类似exec(
)系统调用则用于加载新程序。执行这样的系统调用后,进程将使用包含加载程序的全新地址空间恢复执行。
Unix makes a neat distinction between the process and
the program it is executing. To that end, the fork( ) and _exit( )
system calls are used respectively to create a new
process and to terminate it, while an exec(
)-like system call is invoked to load a new program. After
such a system call is executed, the process resumes execution with a
brand new address space containing the loaded program.
调用 a 的进程fork(
)是父进程,而新进程是其子进程。父进程和子进程可以找到彼此,因为描述每个进程的数据结构包括指向其直接父进程的指针和指向其所有直接子进程的指针。
The process that invokes a fork(
) is the parent, while the new process
is its child. Parents and children can find one
another because the data structure describing each process includes a
pointer to its immediate parent and pointers to all its immediate
children.
一个简单的实现fork(
)将需要复制父级的数据和父级的代码,并将副本分配给子级。这将是相当耗时的。当前可以依赖于硬件分页单元的内核遵循写入时复制方法,该方法将页面复制推迟到最后一刻(即,直到需要父级或子级写入页面为止)。我们将在第 9 章的“写入时复制”部分描述 Linux 如何实现该技术。
A naive implementation of the fork(
) would require both the parent's data and the parent's code
to be duplicated and the copies assigned to the child. This would be
quite time consuming. Current kernels that can rely on hardware paging
units follow the Copy-On-Write approach, which defers page duplication
until the last moment (i.e., until the parent or the child is required
to write into a page). We shall describe how Linux implements this
technique in the section "Copy On Write" in Chapter 9.
系统_exit( )调用终止一个进程。内核通过释放进程拥有的资源并向父进程发送信号来处理此系统调用
SIGCHLD,默认情况下会忽略该信号。
The _exit( ) system call
terminates a process. The kernel handles this system call by releasing
the resources owned by the process and sending the parent process a
SIGCHLD signal, which is ignored by
default.
父进程如何查询其子进程的终止?这wait4( ) 系统调用允许进程等待,直到其子进程之一终止;它返回终止子进程的进程 ID (PID)。
How can a parent process inquire about termination of its
children? The wait4( ) system call allows a process to wait until one of its
children terminates; it returns the process ID (PID) of the
terminated child.
当执行这个系统调用时,内核检查子进程是否已经终止。引入了一种特殊的僵尸
进程状态来表示终止的进程:进程保持在该状态,直到其父进程对其执行系统
wait4( )调用。系统调用处理程序从进程描述符字段中提取有关资源使用情况的数据;一旦收集到数据,进程描述符就可以被释放。如果执行系统调用时没有子进程已经终止wait4( )
,内核通常会将进程置于等待状态,直到子进程终止。
When executing this system call, the kernel checks whether a
child has already terminated. A special zombie
process state is introduced to represent terminated processes: a
process remains in that state until its parent process executes a
wait4( ) system call on it. The
system call handler extracts data about resource usage from the
process descriptor fields; the process descriptor may be released
once the data is collected. If no child process has already
terminated when the wait4( )
system call is executed, the kernel usually puts the process in a
wait state until a child terminates.
许多内核还实现了waitpid( ) 系统调用,它允许进程等待特定的子进程。系统调用的其他变体wait4( )也很常见。
Many kernels also implement a waitpid( ) system call, which allows a process to wait for a
specific child process. Other variants of wait4( ) system calls are also quite
common.
对于内核来说,在父进程发出其调用之前保留有关子进程的信息是一种很好的做法wait4( ),但是假设父进程在没有发出该调用的情况下终止呢?这些信息占用了可用于服务生命进程的宝贵内存插槽。例如,许多 shell 允许用户在后台启动命令,然后注销。运行命令 shell 的进程终止,但其子进程继续执行。
It's good practice for the kernel to keep around information
on a child process until the parent issues its wait4( ) call, but suppose the parent
process terminates without issuing that call? The information takes
up valuable memory slots that could be used to serve living
processes. For example, many shells allow the user to start a
command in the background and then log out. The process that is
running the command shell terminates, but its children continue
their execution.
解决方案在于一个名为
init的特殊系统进程,它是在系统初始化期间创建的。当进程终止时,内核会更改已终止进程的所有现有子进程的相应进程描述符指针,使它们成为
init的子进程。该进程监视其所有子进程的执行,并定期发出wait4( )系统调用,其副作用是清除所有孤立的僵尸进程。
The solution lies in a special system process called
init, which is created during system
initialization. When a process terminates, the kernel changes the
appropriate process descriptor pointers of all the existing children
of the terminated process to make them become children of
init. This process monitors the execution of
all its children and routinely issues wait4( ) system calls, whose side effect
is to get rid of all orphaned zombies.
现代 Unix 操作系统引入了进程组的概念 代表“工作”抽象。例如,为了执行命令行:
Modern Unix operating systems introduce the notion of process groups to represent a "job" abstraction. For example, in order to execute the command line:
$ ls | 排序| 更多的
$ ls | sort | more
支持进程组(例如 )的 shell会为对应于、
和bash的三个进程创建一个新组。通过这种方式,shell 作用于三个进程,就好像它们是一个实体(准确地说是作业)。每个进程描述符都包含一个包含进程组 ID 的字段
lssortmore 。每组进程可能有一个组长,组长是PID与进程组ID一致的进程。新创建的进程最初会插入到其父进程的进程组中。
a shell that supports process groups, such as bash, creates a new group for the three
processes corresponding to ls,
sort, and more. In this way, the shell acts on the
three processes as if they were a single entity (the job, to be
precise). Each process descriptor includes a field containing the
process group ID . Each group of processes may have a group
leader, which is the process whose PID coincides with the
process group ID. A newly created process is initially inserted into
the process group of its parent.
现代 Unix 内核还引入了登录会话。通俗地说,登录会话包含在特定终端上启动工作会话的进程(通常是为用户创建的第一个命令 shell 进程)的后代的所有进程。进程组中的所有进程必须位于同一登录会话中。一个登录会话可能有多个进程组同时处于活动状态;这些进程组之一始终位于前台,这意味着它可以访问终端。其他活动进程组位于后台。当后台进程尝试访问终端时,它会收到一个SIGTTIN或SIGTTOUT信号。在许多命令 shell 中,内部命令bg和
fg可用于将进程组置于后台或前台。
Modern Unix kernels also introduce login
sessions. Informally, a login session contains all
processes that are descendants of the process that has started a
working session on a specific terminal—usually, the first command
shell process created for the user. All processes in a process group
must be in the same login session. A login session may have several
process groups active simultaneously; one of these process groups is
always in the foreground, which means that it has access to the
terminal. The other active process groups are in the background.
When a background process tries to access the terminal, it receives
a SIGTTIN or SIGTTOUT signal. In many command shells,
the internal commands bg and
fg can be used to put a process
group in either the background or the foreground.
内存管理是 Unix 内核中迄今为止最复杂的活动。本书超过三分之一的内容专门用于描述 Linux 如何处理内存管理。本节说明与内存管理相关的一些主要问题。
Memory management is by far the most complex activity in a Unix kernel. More than a third of this book is dedicated just to describing how Linux handles memory management. This section illustrates some of the main issues related to memory management.
所有最新的 Unix 系统都提供了一个有用的抽象,称为 虚拟内存 。虚拟内存充当应用程序内存请求和硬件内存管理单元之间的逻辑层(MMU)。虚拟内存有很多用途和优点:
All recent Unix systems provide a useful abstraction called virtual memory . Virtual memory acts as a logical layer between the application memory requests and the hardware Memory Management Unit (MMU). Virtual memory has many purposes and advantages:
多个进程可以同时执行。
Several processes can be executed concurrently.
可以运行内存需求大于可用物理内存的应用程序。
It is possible to run applications whose memory needs are larger than the available physical memory.
进程可以执行其代码仅部分加载到内存中的程序。
Processes can execute a program whose code is only partially loaded in memory.
每个进程都可以访问可用物理内存的子集。
Each process is allowed to access a subset of the available physical memory.
进程可以共享库或程序的单个内存映像。
Processes can share a single memory image of a library or program.
程序可以重定位——也就是说,它们可以放置在物理内存中的任何位置。
Programs can be relocatable — that is, they can be placed anywhere in physical memory.
程序员可以编写与机器无关的代码,因为他们不需要关心物理内存组织。
Programmers can write machine-independent code, because they do not need to be concerned about physical memory organization.
虚拟内存子系统的主要组成部分是虚拟地址空间的概念。进程可以使用的内存引用集与物理内存地址不同。当进程使用虚拟地址时,[ * ]内核和MMU合作查找所请求内存项的实际物理位置。
The main ingredient of a virtual memory subsystem is the notion of virtual address space. The set of memory references that a process can use is different from physical memory addresses. When a process uses a virtual address,[*] the kernel and the MMU cooperate to find the actual physical location of the requested memory item.
当今的 CPU 包含自动将虚拟地址转换为物理地址的硬件电路。为此,可用 RAM 被划分为页框— 长度通常为 4 或 8 KB — 并引入一组页表来指定虚拟地址如何与物理地址相对应。这些电路使存储器分配更简单,因为可以通过分配一组具有不连续物理地址的页框来满足对连续虚拟地址块的请求。
Today's CPUs include hardware circuits that automatically translate the virtual addresses into physical ones. To that end, the available RAM is partitioned into page frames —typically 4 or 8 KB in length—and a set of Page Tables is introduced to specify how virtual addresses correspond to physical addresses. These circuits make memory allocation simpler, because a request for a block of contiguous virtual addresses can be satisfied by allocating a group of page frames having noncontiguous physical addresses.
所有 Unix 操作系统都清楚地区分随机存取存储器 (RAM) 的两个部分。几兆字节专门用于存储内核映像(即内核代码和内核静态数据结构)。RAM 的剩余部分通常由虚拟内存系统处理,并以三种可能的方式使用:
All Unix operating systems clearly distinguish between two portions of the random access memory (RAM). A few megabytes are dedicated to storing the kernel image (i.e., the kernel code and the kernel static data structures). The remaining portion of RAM is usually handled by the virtual memory system and is used in three possible ways:
满足内核对缓冲区、描述符和其他动态内核数据结构的请求
To satisfy kernel requests for buffers, descriptors, and other dynamic kernel data structures
满足通用内存区域和文件内存映射的进程请求
To satisfy process requests for generic memory areas and for memory mapping of files
通过缓存从磁盘和其他缓冲设备获得更好的性能
To get better performance from disks and other buffered devices by means of caches
每种请求类型都很有价值。另一方面,由于可用 RAM 有限,因此必须在请求类型之间进行一些平衡,特别是当可用内存所剩无几时。此外,当达到可用内存的某个临界阈值并调用页帧回收算法来释放额外内存时,哪些页帧最适合回收?正如我们将在第 17 章中看到的,这个问题没有简单的答案,也很少有理论支持。唯一可用的解决方案在于开发精心调整的经验算法。
Each request type is valuable. On the other hand, because the available RAM is limited, some balancing among request types must be done, particularly when little available memory is left. Moreover, when some critical threshold of available memory is reached and a page-frame-reclaiming algorithm is invoked to free additional memory, which are the page frames most suitable for reclaiming? As we will see in Chapter 17, there is no simple answer to this question and very little support from theory. The only available solution lies in developing carefully tuned empirical algorithms.
虚拟内存系统必须解决的一大问题是内存碎片 。理想情况下,只有当空闲页框的数量太小时,内存请求才会失败。然而,内核经常被迫使用物理上连续的内存区域。因此,即使有足够的可用内存,内存请求也可能会失败,但它不能作为一个连续的块来使用。
One major problem that must be solved by the virtual memory system is memory fragmentation . Ideally, a memory request should fail only when the number of free page frames is too small. However, the kernel is often forced to use physically contiguous memory areas. Hence the memory request could fail even if there is enough memory available, but it is not available as one contiguous chunk.
内核内存分配器 ( KMA ) 是一个尝试满足系统所有部分对内存区域的请求的子系统。其中一些请求来自需要内存供内核使用的其他内核子系统,一些请求来自用户程序的系统调用,以增加其进程的地址空间。一个好的KMA应该具备以下特点:
The Kernel Memory Allocator (KMA) is a subsystem that tries to satisfy the requests for memory areas from all parts of the system. Some of these requests come from other kernel subsystems needing memory for kernel use, and some requests come via system calls from user programs to increase their processes' address spaces. A good KMA should have the following features:
一定要快。实际上,这是最关键的属性,因为它被所有内核子系统(包括中断处理程序)调用。
It must be fast. Actually, this is the most crucial attribute, because it is invoked by all kernel subsystems (including the interrupt handlers).
它应该最大限度地减少浪费的内存量。
It should minimize the amount of wasted memory.
应该尽量减少内存碎片问题。
It should try to reduce the memory fragmentation problem.
它应该能够与其他内存管理子系统合作,借用和释放它们的页框。
It should be able to cooperate with the other memory management subsystems to borrow and release page frames from them.
几种提议的 KMA 基于各种不同的算法技术,包括:
Several proposed KMAs, which are based on a variety of different algorithmic techniques, include:
正如我们将在第 8 章中看到的,Linux 的 KMA 在伙伴系统之上使用 Slab 分配器。
As we will see in Chapter 8, Linux's KMA uses a Slab allocator on top of a buddy system.
进程的地址空间包含允许该进程引用的所有虚拟内存地址。内核通常将进程虚拟地址空间存储为
内存区域描述符列表 。例如,当进程通过exec( )类似系统调用开始执行某个程序时,内核会为进程分配一个虚拟地址空间,其中包含以下内存区域:
The address space of a process contains all the virtual memory
addresses that the process is allowed to reference. The kernel
usually stores a process virtual address space as a list of
memory area descriptors . For example, when a process starts the execution of
some program via an exec( )-like
system call, the kernel assigns to the process a virtual address
space that comprises memory areas for:
程序的可执行代码
The executable code of the program
程序的初始化数据
The initialized data of the program
程序未初始化的数据
The uninitialized data of the program
初始程序堆栈(即用户态堆栈)
The initial program stack (i.e., the User Mode stack)
所需共享库的可执行代码和数据
The executable code and data of needed shared libraries
堆(程序动态申请的内存)
The heap (the memory dynamically requested by the program)
所有最新的 Unix 操作系统都采用称为需求分页的内存分配策略 。通过请求分页,进程可以在物理内存中没有任何页面的情况下启动程序执行。当它访问不存在的页面时,MMU 会生成异常;异常处理程序找到受影响的内存区域,分配一个空闲页面,并使用适当的数据对其进行初始化。以类似的方式,当进程通过使用动态需要内存时malloc( ),或者brk( ) 系统调用(由 内部调用malloc( )),内核只是更新进程的堆内存区域的大小。仅当进程尝试引用其虚拟内存地址而生成异常时,才会将页框分配给该进程。
All recent Unix operating systems adopt a memory allocation
strategy called demand paging . With demand paging, a process can start program
execution with none of its pages in physical memory. As it accesses
a nonpresent page, the MMU generates an exception; the exception
handler finds the affected memory region, allocates a free page, and
initializes it with the appropriate data. In a similar fashion, when
the process dynamically requires memory by using malloc( ), or the brk( ) system call (which is invoked internally by malloc( )), the kernel just updates the
size of the heap memory region of the process. A page frame is
assigned to the process only when it generates an exception by
trying to refer its virtual memory addresses.
虚拟地址空间还允许其他有效的策略,例如前面提到的 Copy On Write 策略。例如,当创建一个新进程时,内核只是将父进程的页框分配给子地址空间,但将它们标记为只读。当父级或子级尝试修改页面内容时,就会引发异常。异常处理程序为受影响的进程分配一个新的页框,并使用原始页的内容对其进行初始化。
Virtual address spaces also allow other efficient strategies, such as the Copy On Write strategy mentioned earlier. For example, when a new process is created, the kernel just assigns the parent's page frames to the child address space, but marks them read-only. An exception is raised as soon the parent or the child tries to modify the contents of a page. The exception handler assigns a new page frame to the affected process and initializes it with the contents of the original page.
可用物理内存的很大一部分用作硬盘和其他块设备的缓存。这是因为硬盘驱动器的速度非常慢:一次磁盘访问需要几毫秒,这与 RAM 访问时间相比是一个非常长的时间。因此,磁盘往往是系统性能的瓶颈。作为一般规则,最早的 Unix 系统中已经实施的策略之一是尽可能推迟写入磁盘。因此,之前从磁盘读取且不再被任何进程使用的数据将继续保留在 RAM 中。
A good part of the available physical memory is used as cache for hard disks and other block devices. This is because hard drives are very slow: a disk access requires several milliseconds, which is a very long time compared with the RAM access time. Therefore, disks are often the bottleneck in system performance. As a general rule, one of the policies already implemented in the earliest Unix system is to defer writing to disk as long as possible. As a result, data read previously from disk and no longer used by any process continue to stay in RAM.
该策略基于以下事实:新进程很可能需要由不再存在的进程从磁盘读取数据或将数据写入磁盘。当进程请求访问磁盘时,内核首先检查所需的数据是否在缓存中。每次发生这种情况(缓存命中)时,内核都能够在不访问磁盘的情况下为进程请求提供服务。
This strategy is based on the fact that there is a good chance that new processes will require data read from or written to disk by processes that no longer exist. When a process asks to access a disk, the kernel checks first whether the required data are in the cache. Each time this happens (a cache hit), the kernel is able to service the process request without accessing the disk.
这sync( ) 系统调用通过将所有“脏”缓冲区(即内容与相应磁盘块的内容不同的所有缓冲区)写入磁盘来强制磁盘同步。为了避免数据丢失,所有操作系统都会定期将脏缓冲区写回磁盘。
The sync( ) system call forces disk synchronization by writing
all of the "dirty" buffers (i.e., all the buffers whose contents
differ from that of the corresponding disk blocks) into disk. To
avoid data loss, all operating systems take care to periodically
write dirty buffers back to disk.
内核通过 设备驱动程序与I/O设备交互 。设备驱动程序包含在内核中,由控制一个或多个设备(例如硬盘、键盘、鼠标、监视器、网络接口和连接到 SCSI 总线的设备)的数据结构和函数组成。每个驱动程序通过特定的接口与内核的其余部分(甚至与其他驱动程序)交互。这种方法有以下优点:
The kernel interacts with I/O devices by means of device drivers . Device drivers are included in the kernel and consist of data structures and functions that control one or more devices, such as hard disks, keyboards, mouses, monitors, network interfaces, and devices connected to an SCSI bus. Each driver interacts with the remaining part of the kernel (even with other drivers) through a specific interface. This approach has the following advantages:
设备特定的代码可以封装在特定的模块中。
Device-specific code can be encapsulated in a specific module.
供应商可以在不知道内核源代码的情况下添加新设备;只需知道接口规范。
Vendors can add new devices without knowing the kernel source code; only the interface specifications must be known.
内核以统一的方式处理所有设备,并通过相同的接口访问它们。
The kernel deals with all devices in a uniform way and accesses them through the same interface.
可以将设备驱动程序编写为可以动态加载到内核中的模块,而无需重新启动系统。还可以动态卸载不再需要的模块,从而最大限度地减少 RAM 中存储的内核映像的大小。
It is possible to write a device driver as a module that can be dynamically loaded in the kernel without requiring the system to be rebooted. It is also possible to dynamically unload a module that is no longer needed, therefore minimizing the size of the kernel image stored in RAM.
图 1-4 说明了设备驱动程序如何与内核的其余部分以及进程进行交互。
Figure 1-4 illustrates how device drivers interface with the rest of the kernel and with the processes.
一些用户程序(P)希望在硬件设备上运行。它们使用通常的文件相关系统调用和通常在/dev目录中找到的设备文件向内核发出请求。实际上,设备文件是设备驱动程序接口中用户可见的部分。每个设备文件引用一个特定的设备驱动程序,该驱动程序由内核调用以在硬件组件上执行请求的操作。
Some user programs (P) wish to operate on hardware devices. They make requests to the kernel using the usual file-related system calls and the device files normally found in the /dev directory. Actually, the device files are the user-visible portion of the device driver interface. Each device file refers to a specific device driver, which is invoked by the kernel to perform the requested operation on the hardware component.
在 Unix 推出时,图形终端并不常见且昂贵,因此 Unix 内核只直接处理字母数字终端。当图形终端变得普遍时,临时应用程序(例如 X Window 系统)引入了作为标准进程运行并直接访问图形接口和 RAM 视频区域的 I/O 端口。最近的一些 Unix 内核,例如 Linux 2.6,为图形卡的帧缓冲区提供了抽象,并允许应用程序软件访问它们,而无需了解有关图形接口的 I/O 端口的任何信息(请参阅“帧缓冲区的级别”部分)内核支持”,第13 章。)
At the time Unix was introduced, graphical terminals were uncommon and expensive, so only alphanumeric terminals were handled directly by Unix kernels. When graphical terminals became widespread, ad hoc applications such as the X Window System were introduced that ran as standard processes and accessed the I/O ports of the graphics interface and the RAM video area directly. Some recent Unix kernels, such as Linux 2.6, provide an abstraction for the frame buffer of the graphic card and allow application software to access them without needing to know anything about the I/O ports of the graphics interface (see the section "Levels of Kernel Support" in Chapter 13.)
[ * ]同步问题已在其他作品中得到充分描述;我们向感兴趣的读者推荐有关 Unix 操作系统的书籍(参见参考书目)。
[*] Synchronization problems have been fully described in other works; we refer the interested reader to books on the Unix operating systems (see the Bibliography).
本章讨论寻址技术。幸运的是,操作系统不必自己跟踪物理内存;当今的微处理器包括多个硬件电路,使内存管理更加高效、鲁棒,这样编程错误就不会导致程序外部对内存的不当访问。
This chapter deals with addressing techniques. Luckily, an operating system is not forced to keep track of physical memory all by itself; today's microprocessors include several hardware circuits to make memory management both more efficient and more robust so that programming errors cannot cause improper accesses to memory outside the program.
与本书的其余部分一样,我们在本章中详细介绍了 80 × 86 微处理器如何对存储芯片进行寻址以及 Linux 如何使用可用的寻址电路。我们希望您会发现,当您了解 Linux 最流行的平台上的实现细节时,您将更好地理解分页的一般理论以及如何研究其他平台上的实现。
As in the rest of this book, we offer details in this chapter on how 80 × 86 microprocessors address memory chips and how Linux uses the available addressing circuits. You will find, we hope, that when you learn the implementation details on Linux's most popular platform you will better understand both the general theory of paging and how to research the implementation on other platforms.
这是与内存管理相关的三章中的第一章; 第 8 章讨论内核如何为其自身分配主内存,而第 9 章则讨论如何将线性地址分配给进程。
This is the first of three chapters related to memory management; Chapter 8 discusses how the kernel allocates main memory to itself, while Chapter 9 considers how linear addresses are assigned to processes.
程序员随意将存储器地址称为访问存储器单元内容的方式。但在处理80×86微处理器时,我们必须区分三种地址:
Programmers casually refer to a memory address as the way to access the contents of a memory cell. But when dealing with 80 × 86 microprocessors, we have to distinguish three kinds of addresses:
包含在机器语言指令中以指定操作数或指令的地址。这种类型的地址体现了众所周知的 80 × 86 分段架构,该架构强制 MS-DOS和窗户程序员将他们的程序分成段。每个逻辑地址由一个 段和一个偏移量组成 (或位移)表示从段的开头到实际地址的距离。
Included in the machine language instructions to specify the address of an operand or of an instruction. This type of address embodies the well-known 80 × 86 segmented architecture that forces MS-DOS and Windows programmers to divide their programs into segments . Each logical address consists of a segment and an offset (or displacement) that denotes the distance from the start of the segment to the actual address.
单个 32 位无符号整数,可用于寻址最多 4 GB,即最多 4,294,967,296 个存储单元。线性地址通常用十六进制表示;它们的值范围从0x00000000到
0xffffffff。
A single 32-bit unsigned integer that can be used to address
up to 4 GB — that is, up to 4,294,967,296 memory cells. Linear
addresses are usually represented in hexadecimal notation; their
values range from 0x00000000 to
0xffffffff.
用于对存储芯片中的存储单元进行寻址。它们对应于沿着微处理器的地址引脚发送到存储器总线的电信号。物理地址表示为 32 位或 36 位无符号整数。
Used to address memory cells in memory chips. They correspond to the electrical signals sent along the address pins of the microprocessor to the memory bus. Physical addresses are represented as 32-bit or 36-bit unsigned integers.
内存管理单元(MMU)通过称为分段单元的硬件电路将逻辑地址转换为线性地址; 随后,称为分页单元的第二个硬件电路将线性地址转换为物理地址(见图2-1)。
The Memory Management Unit (MMU) transforms a logical address into a linear address by means of a hardware circuit called a segmentation unit ; subsequently, a second hardware circuit called a paging unit transforms the linear address into a physical address (see Figure 2-1).
在多处理器系统中,所有CPU通常共享相同的内存;这意味着 RAM 芯片可以由独立的 CPU 同时访问。由于RAM芯片上的读或写操作必须串行执行,因此在总线和每个RAM芯片之间插入了一个称为存储器仲裁器的硬件电路。它的作用是,如果芯片空闲,则授予对 CPU 的访问权限;如果芯片正忙于处理另一个处理器的请求,则延迟访问。即使单处理器系统也使用内存仲裁器,因为它们包括称为 DMA 控制器的专用处理器 与CPU同时运行(参见第13章中的“直接内存访问(DMA) ”部分)。在多处理器系统的情况下,仲裁器的结构更加复杂,因为它具有更多的输入端口。例如,双奔腾在每个芯片入口处维护一个双端口仲裁器,并要求两个 CPU 在尝试使用公共总线之前交换同步消息。从编程的角度来看,仲裁器是隐藏的,因为它是由硬件电路管理的。
In multiprocessor systems, all CPUs usually share the same memory; this means that RAM chips may be accessed concurrently by independent CPUs. Because read or write operations on a RAM chip must be performed serially, a hardware circuit called a memory arbiter is inserted between the bus and every RAM chip. Its role is to grant access to a CPU if the chip is free and to delay it if the chip is busy servicing a request by another processor. Even uniprocessor systems use memory arbiters , because they include specialized processors called DMA controllers that operate concurrently with the CPU (see the section "Direct Memory Access (DMA)" in Chapter 13). In the case of multiprocessor systems, the structure of the arbiter is more complex because it has more input ports. The dual Pentium, for instance, maintains a two-port arbiter at each chip entrance and requires that the two CPUs exchange synchronization messages before attempting to use the common bus. From the programming point of view, the arbiter is hidden because it is managed by hardware circuits.
从80286开始模型中,英特尔微处理器以两种不同的方式(称为实模式)执行地址转换 和保护模式 。我们将在下一节中重点介绍启用保护模式时的地址转换。实模式的存在主要是为了保持处理器与旧型号的兼容性并允许操作系统引导(有关实模式的简短描述,请参阅附录 A )。
Starting with the 80286 model, Intel microprocessors perform address translation in two different ways called real mode and protected mode . We'll focus in the next sections on address translation when protected mode is enabled. Real mode exists mostly to maintain processor compatibility with older models and to allow the operating system to bootstrap (see Appendix A for a short description of real mode).
逻辑地址由两部分组成:段标识符和指定段内相对地址的偏移量。段标识符是一个称为段选择器的 16 位字段(见图2-2),而偏移量是一个 32 位字段。我们将在本章后面的“快速访问段描述符”部分中描述段选择器的字段。
A logical address consists of two parts: a segment identifier and an offset that specifies the relative address within the segment. The segment identifier is a 16-bit field called the Segment Selector (see Figure 2-2), while the offset is a 32-bit field. We'll describe the fields of Segment Selectors in the section "Fast Access to Segment Descriptors" later in this chapter.
为了方便检索段选择器很快,处理器提供分段寄存器 其唯一目的是持有段选择器;这些寄存器称为cs、ss、ds、
es、fs和gs。尽管只有六个,但程序可以通过将其内容保存在内存中然后在以后恢复来将相同的分段寄存器用于不同的目的。
To make it easy to retrieve segment selectors quickly, the processor provides segmentation
registers whose only purpose is to hold Segment Selectors; these
registers are called cs, ss, ds,
es, fs, and gs. Although there are only six of them, a
program can reuse the same segmentation register for different
purposes by saving its content in memory and then restoring it
later.
六个分段寄存器中的三个具有特定用途:
Three of the six segmentation registers have specific purposes:
cscs代码段寄存器,指向包含程序指令的段
The code segment register, which points to a segment containing program instructions
ssss堆栈段寄存器,指向包含当前程序堆栈的段
The stack segment register, which points to a segment containing the current program stack
dsds数据段寄存器,指向包含全局和静态数据的段
The data segment register, which points to a segment containing global and static data
其余三个分段寄存器是通用的,可以引用任意数据段。
The remaining three segmentation registers are general purpose and may refer to arbitrary data segments.
该cs寄存器还有另一个重要功能:它包括一个 2 位字段,用于指定 CPU 的当前特权级别 (CPL)。值 0 表示最高权限级别,而值 3 表示最低权限级别。Linux仅使用0级和3级,分别称为内核模式和用户模式。
The cs register has another
important function: it includes a 2-bit field that specifies the
Current Privilege Level (CPL) of the CPU. The value 0 denotes the
highest privilege level, while the value 3 denotes the lowest one.
Linux uses only levels 0 and 3, which are respectively called Kernel
Mode and User Mode.
每个段由描述段特征的8 字节段描述符表示 。段描述符存储在 全局描述符表(GDT )或本地描述符表(LDT)。
Each segment is represented by an 8-byte Segment Descriptor that describes the segment characteristics. Segment Descriptors are stored either in the Global Descriptor Table (GDT ) or in the Local Descriptor Table(LDT).
通常只定义一个GDT,而每个进程如果需要创建除GDT中存储的段之外的其他段,则允许有自己的LDT。主存中GDT的地址和大小包含在gdtr
控制寄存器,当前使用的LDT的地址和大小包含在ldtr 控制寄存器。
Usually only one GDT is defined, while each process is permitted
to have its own LDT if it needs to create additional segments besides
those stored in the GDT. The address and size of the GDT in main
memory are contained in the gdtr
control register, while the address and size of the
currently used LDT are contained in the ldtr control register.
图2-3 说明了段描述符的格式;各个字段的含义如表2-1所示。
Figure 2-3 illustrates the format of a Segment Descriptor; the meaning of the various fields is explained in Table 2-1.
表 2-1。段描述符字段
Table 2-1. Segment Descriptor fields
段有多种类型,因此段描述符也有多种类型。下面的列表显示了 Linux 中广泛使用的类型。
There are several types of segments, and thus several types of Segment Descriptors. The following list shows the types that are widely used in Linux.
表示Segment Descriptor指的是一个代码段;它可以包含在 GDT 或 LDT 中。描述符已S设置标志(非系统段)。
Indicates that the Segment Descriptor refers to a code
segment; it may be included either in the GDT or in the LDT. The
descriptor has the S flag set
(non-system segment).
表示Segment Descriptor指的是一个数据段;它可以包含在 GDT 或 LDT 中。描述符已S设置标志。堆栈段是通过通用数据段来实现的。
Indicates that the Segment Descriptor refers to a data
segment; it may be included either in the GDT or in the LDT. The
descriptor has the S flag
set. Stack segments are implemented by means of generic data
segments.
表示段描述符指的是任务状态段(TSS)——即用于保存处理器寄存器内容的段(参见第3章“任务状态段”一节);它只能出现在 GDT 中。相应字段的值为 11 或 9,具体取决于相应进程当前是否正在 CPU 上执行。此类描述符的标志
设置为 0。TypeS
Indicates that the Segment Descriptor refers to a Task
State Segment (TSS) — that is, a segment used to save the
contents of the processor registers (see the section "Task State Segment"
in Chapter 3); it can
appear only in the GDT. The corresponding Type field has the value 11 or 9,
depending on whether the corresponding process is currently
executing on a CPU. The S
flag of such descriptors is set to 0.
表示Segment Descriptor指的是包含LDT的段;它只能出现在 GDT 中。相应Type字段的值为 2。S此类描述符的标志设置为 0。下一节将展示 80 × 86 处理器如何决定段描述符是存储在进程的 GDT 还是 LDT 中。
Indicates that the Segment Descriptor refers to a segment
containing an LDT; it can appear only in the GDT. The
corresponding Type field has
the value 2. The S flag of
such descriptors is set to 0. The next section shows how 80 × 86
processors are able to decide whether a segment descriptor is
stored in the GDT or in the LDT of the process.
我们记得逻辑地址由 16 位段选择器和 32 位偏移量组成,并且分段寄存器仅存储段选择器。
We recall that logical addresses consist of a 16-bit Segment Selector and a 32-bit Offset, and that segmentation registers store only the Segment Selector.
为了加速逻辑地址到线性地址的转换,80 × 86 处理器为六个可编程分段寄存器中的每一个提供了一个额外的不可编程寄存器,即程序员无法设置的寄存器。每个不可编程寄存器都包含由相应分段寄存器中包含的段选择器指定的 8 字节段描述符(在上一节中描述)。每次将段选择器加载到段寄存器中时,相应的段描述符就会从内存加载到匹配的不可编程 CPU 寄存器中。从那时起,可以在不访问主存中存储的 GDT 或 LDT 的情况下执行引用该段的逻辑地址的转换;处理器只能直接引用包含段描述符的CPU寄存器。仅当分段寄存器的内容发生更改时才需要访问 GDT 或 LDT(请参见图 2-4 )。
To speed up the translation of logical addresses into linear addresses, the 80 × 86 processor provides an additional nonprogrammable register—that is, a register that cannot be set by a programmer—for each of the six programmable segmentation registers. Each nonprogrammable register contains the 8-byte Segment Descriptor (described in the previous section) specified by the Segment Selector contained in the corresponding segmentation register. Every time a Segment Selector is loaded in a segmentation register, the corresponding Segment Descriptor is loaded from memory into the matching nonprogrammable CPU register. From then on, translations of logical addresses referring to that segment can be performed without accessing the GDT or LDT stored in main memory; the processor can refer only directly to the CPU register containing the Segment Descriptor. Accesses to the GDT or LDT are necessary only when the contents of the segmentation registers change (see Figure 2-4).
任何段选择器都包含表 2-2 中描述的三个字段。
Any Segment Selector includes three fields that are described in Table 2-2.
表 2-2。段选择器字段
Table 2-2. Segment Selector fields
因为段描述符有 8 个字节长,所以它在 GDT 或 LDT 中的相对地址是通过将段选择器的 13 位索引字段乘以 8 得到的。例如,如果 GDT 位于(寄存器中存储的值0x00020000)gdtr),段选择器指定的索引为2,则对应的段描述符的地址为0x00020000+ (2× 8),或0x00020010。
Because a Segment Descriptor is 8 bytes long, its relative
address inside the GDT or the LDT is obtained by multiplying the
13-bit index field of the Segment Selector by 8. For instance, if the
GDT is at 0x00020000 (the value
stored in the gdtr register) and
the index specified by the Segment Selector is 2, the address of the
corresponding Segment Descriptor is 0x00020000 + (2 × 8),
or 0x00020010.
GDT 的第一个条目始终设置为 0。这确保了具有空段选择器的逻辑地址将被视为无效,从而导致处理器异常。GDT 中可存储的段描述符的最大数量为 8,191(即 2 13 -1)。
The first entry of the GDT is always set to 0. This ensures that logical addresses with a null Segment Selector will be considered invalid, thus causing a processor exception. The maximum number of Segment Descriptors that can be stored in the GDT is 8,191 (i.e., 213-1).
图2-5详细显示了逻辑地址如何转换为相应的线性地址。分割单元执行以下操作:
Figure 2-5 shows in detail how a logical address is translated into a corresponding linear address. The segmentation unit performs the following operations:
检查TI段选择器的字段以确定哪个描述符表存储段描述符。该字段指示描述符要么在 GDT 中(在这种情况下,分段单元从寄存器获取 GDT 的基线性地址gdtr),要么在活动 LDT 中(在这种情况下,分段单元从寄存器中获取该 LDT 的基线性地址)来自ldtr
登记)。
Examines the TI field of
the Segment Selector to determine which Descriptor Table stores
the Segment Descriptor. This field indicates that the Descriptor
is either in the GDT (in which case the segmentation unit gets the
base linear address of the GDT from the gdtr register) or in the active LDT (in
which case the segmentation unit gets the base linear address of
that LDT from the ldtr
register).
index根据段选择器的字段计算段描述符的地址
。该index字段乘以 8(段描述符的大小),并将结果添加到gdtr或ldtr寄存器的内容中。
Computes the address of the Segment Descriptor from the
index field of the Segment
Selector. The index field is
multiplied by 8 (the size of a Segment Descriptor), and the result
is added to the content of the gdtr or ldtr register.
将逻辑地址的偏移量添加到BaseSegment Descriptor的字段中,从而得到线性地址。
Adds the offset of the logical address to the Base field of the Segment Descriptor,
thus obtaining the linear address.
请注意,由于与分段寄存器关联的不可编程寄存器,仅当分段寄存器已更改时才需要执行前两个操作。
Notice that, thanks to the nonprogrammable registers associated with the segmentation registers, the first two operations need to be performed only when a segmentation register has been changed.
分段已包含在 80 × 86 微处理器中,以鼓励程序员将其应用程序拆分为逻辑相关的实体,例如子例程或全局和本地数据区域。然而,Linux 使用分段的方式非常有限。其实分段和分页都有些多余,因为两者都可以用来分隔进程的物理地址空间:分段可以为每个进程分配不同的线性地址空间,而分页可以将相同的线性地址空间映射到不同的物理地址空间。Linux 更喜欢分页而不是分段,原因如下:
Segmentation has been included in 80 × 86 microprocessors to encourage programmers to split their applications into logically related entities, such as subroutines or global and local data areas. However, Linux uses segmentation in a very limited way. In fact, segmentation and paging are somewhat redundant, because both can be used to separate the physical address spaces of processes: segmentation can assign a different linear address space to each process, while paging can map the same linear address space into different physical address spaces. Linux prefers paging to segmentation for the following reasons:
当所有进程使用相同的段寄存器值时,即当它们共享同一组线性地址时,内存管理会更简单。
Memory management is simpler when all processes use the same segment register values — that is, when they share the same set of linear addresses.
Linux 的设计目标之一是可移植到多种体系结构;RISC 架构对分段的支持尤其有限。
One of the design objectives of Linux is portability to a wide range of architectures; RISC architectures in particular have limited support for segmentation.
Linux 2.6版本仅在80×86架构需要时才使用分段。
The 2.6 version of Linux uses segmentation only when required by the 80 × 86 architecture.
所有在用户模式下运行的 Linux 进程都使用同一对段来寻址指令和数据。这些段称为 用户代码段 和用户数据段 , 分别。类似地,所有运行在内核模式下的Linux进程都使用同一对段来寻址指令和数据:它们被称为内核代码段 和内核数据段 , 分别。表 2-3显示了这四个关键段的段描述符字段的值。
All Linux processes running in User Mode use the same pair of segments to address instructions and data. These segments are called user code segment and user data segment , respectively. Similarly, all Linux processes running in Kernel Mode use the same pair of segments to address instructions and data: they are called kernel code segment and kernel data segment , respectively. Table 2-3 shows the values of the Segment Descriptor fields for these four crucial segments.
表 2-3。四个主要 Linux 段的段描述符字段的值
Table 2-3. Values of the Segment Descriptor fields for the four main Linux segments
部分 Segment | 根据 Base | G G | 限制 Limit | S S | 类型 Type | 数据保护层 DPL | D B D/B | 磷 P |
|---|---|---|---|---|---|---|---|---|
用户代码 user code | | 1 1 | | 1 1 | 10 10 | 3 3 | 1 1 | 1 1 |
用户数据 user data | | 1 1 | | 1 1 | 2 2 | 3 3 | 1 1 | 1 1 |
内核代码 kernel code | | 1 1 | | 1 1 | 10 10 | 0 0 | 1 1 | 1 1 |
内核数据 kernel data | | 1 1 | | 1 1 | 2 2 | 0 0 | 1 1 | 1 1 |
相应的段选择器分别由宏
_ _USER_CS、_ _USER_DS、_
_KERNEL_CS和定义_
_KERNEL_DS。例如,为了寻址内核代码段,内核只需将宏生成的值加载_ _KERNEL_CS到cs分段寄存器中。
The corresponding Segment Selectors are defined by the macros
_ _USER_CS, _ _USER_DS, _
_KERNEL_CS, and _
_KERNEL_DS, respectively. To address the kernel code segment,
for instance, the kernel just loads the value yielded by the _ _KERNEL_CS macro into the cs segmentation register.
请注意,与此类段关联的线性地址全部从 0 开始,并达到寻址限制 2 32 -1。这意味着所有进程,无论是在用户模式还是在内核模式,都可以使用相同的逻辑地址。
Notice that the linear addresses associated with such segments all start at 0 and reach the addressing limit of 232 -1. This means that all processes, either in User Mode or in Kernel Mode, may use the same logical addresses.
让所有段开始的另一个重要结果
0x00000000是,在 Linux 中,逻辑地址与线性地址一致。也就是说,逻辑地址的Offset字段的值总是与对应的线性地址的值一致。
Another important consequence of having all segments start at
0x00000000 is that in Linux, logical
addresses coincide with linear addresses; that is, the value of the
Offset field of a logical address always coincides with the value of the
corresponding linear address.
如前所述,CPU的当前特权级别指示处理器处于用户模式还是内核模式,并且由RPL存储在寄存器中的段选择器的字段指定cs
。每当CPL改变时,一些分段寄存器必须相应地更新。例如,当CPL等于3(用户模式)时,ds寄存器必须包含用户数据段的段选择器,但当CPL等于0时,寄存器ds必须包含内核数据段的段选择器。
As stated earlier, the Current Privilege Level of the CPU
indicates whether the processor is in User or Kernel Mode and is
specified by the RPL field of the
Segment Selector stored in the cs
register. Whenever the CPL is changed, some segmentation registers must
be correspondingly updated. For instance, when the CPL is equal to 3 (User Mode), the ds register must contain the Segment Selector
of the user data segment, but when the CPL is equal to 0, the ds register must contain the Segment Selector
of the kernel data segment.
寄存器也会出现类似的情况ss。当CPL为3时,它必须引用用户数据段内的用户模式堆栈,当CPL为0时,它必须引用内核数据段内的内核模式堆栈。当从用户模式切换到内核模式时,Linux总是确保ss寄存器包含内核数据段的段选择器。
A similar situation occurs for the ss register. It must refer to a User Mode
stack inside the user data segment when the CPL is 3, and it must refer
to a Kernel Mode stack inside the kernel data segment when the CPL is 0.
When switching from User Mode to Kernel Mode, Linux always makes sure
that the ss register contains the
Segment Selector of the kernel data segment.
当保存指向指令或数据结构的指针时,内核不需要存储逻辑地址的段选择器组件,因为寄存器ss
包含当前的段选择器。举个例子,当内核调用一个函数时,它会执行一个call 汇编语言指令仅指定其逻辑地址的偏移量部分;段选择器被隐式选择为寄存器引用的段选择器cs。因为只有一个“在内核模式下可执行”类型的段,即由 标识的代码段,所以每当 CPU 切换到内核模式时加载__KERNEL_CS就足够了。相同的参数适用于指向内核数据结构的指针(隐式使用寄存器)以及指向用户数据结构的指针(内核显式使用寄存器)。__KERNEL_CScsdses
When saving a pointer to an instruction or to a data structure,
the kernel does not need to store the Segment Selector component of the
logical address, because the ss
register contains the current Segment Selector. As an example, when the
kernel invokes a function, it executes a call assembly language instruction specifying just the Offset
component of its logical address; the Segment Selector is implicitly
selected as the one referred to by the cs register. Because there is just one segment
of type "executable in Kernel Mode," namely the code segment identified
by __KERNEL_CS, it is sufficient to
load __KERNEL_CS into cs whenever the CPU switches to Kernel Mode.
The same argument goes for pointers to kernel data structures
(implicitly using the ds register),
as well as for pointers to user data structures (the kernel explicitly
uses the es register).
除了刚刚描述的四个段之外,Linux 还使用其他一些专用段。我们将在下一节描述 Linux GDT 时介绍它们。
Besides the four segments just described, Linux makes use of a few other specialized segments. We'll introduce them in the next section while describing the Linux GDT.
在单处理器系统中,只有一个 GDT,而在多处理器系统中,系统中的每个 CPU 都有一个 GDT。所有GDT都存储在cpu_gdt_table数组中,而GDT的地址和大小(初始化寄存器时使用gdtr)也存储在cpu_gdt_descr数组中。如果你查看源代码索引,你可以看到这些符号是在文件arch/i386/kernel/head.S中定义的
。本书中的每个宏、函数和其他符号都列在源代码索引中,因此您可以在源代码中快速找到它们。
In uniprocessor systems there is only one GDT, while in
multiprocessor systems there is one GDT for every CPU in the system.
All GDTs are stored in the cpu_gdt_table array, while the addresses and
sizes of the GDTs (used when initializing the gdtr registers) are stored in the cpu_gdt_descr array. If you look in the
Source Code Index, you can see that these symbols are defined in the
file arch/i386/kernel/head.S
. Every macro, function, and other symbol in this book
is listed in the Source Code Index, so you can quickly find it in the
source code.
GDT 的布局如图 2-6所示。每个 GDT 包括 18 个段描述符和 14 个空、未使用或保留的条目。未使用的条目是有意插入的,以便通常一起访问的段描述符保留在硬件缓存的同一 32 字节行中(请参阅本章后面的“硬件缓存”部分)。
The layout of the GDTs is shown schematically in Figure 2-6. Each GDT includes 18 segment descriptors and 14 null, unused, or reserved entries. Unused entries are inserted on purpose so that Segment Descriptors usually accessed together are kept in the same 32-byte line of the hardware cache (see the section "Hardware Cache" later in this chapter).
每个GDT中包含的18个段描述符指向以下段:
The 18 segment descriptors included in each GDT point to the following segments:
四个用户和内核代码和数据段(请参阅上一节)。
Four user and kernel code and data segments (see previous section).
任务状态段 (TSS),对于系统中的每个处理器都不同。TSS对应的线性地址空间是内核数据段对应的线性地址空间的一个小子集。Task State Segments 依次存储在init_tss数组中;特别是,第 n个BaseCPU的 TSS 描述符字段指向
数组的第 n个组件。(粒度)标志
被清除,而该字段设置为,因为 TSS 段长为 236 字节。这
init_tssGLimit0xebType字段设置为9或11(可用32位TSS),设置DPL为0,因为用户模式下的进程不允许访问TSS段。您将在第 3 章的“任务状态段”部分找到有关 Linux 如何使用 TSS 的详细信息。
A Task State Segment (TSS), different for each processor in
the system. The linear address space corresponding to a TSS is a
small subset of the linear address space corresponding to the
kernel data segment. The Task State Segments are sequentially
stored in the init_tss array;
in particular, the Base field
of the TSS descriptor for the n
th CPU points to the
n th component of
the init_tss array. The
G (granularity) flag is
cleared, while the Limit field
is set to 0xeb, because the TSS
segment is 236 bytes long. The Type field is set to 9 or 11 (available
32-bit TSS), and the DPL is set
to 0, because processes in User Mode are not allowed to access TSS
segments. You will find details on how Linux uses TSSs in the
section "Task State
Segment" in Chapter
3.
包含默认本地描述符表 (LDT) 的段,通常由所有进程共享(请参阅下一节)。
A segment including the default Local Descriptor Table (LDT), usually shared by all processes (see the next section).
三个线程本地存储 (TLS)
段:这是一种允许多线程应用程序使用最多三个包含每个线程本地数据的段的机制。这set_thread_area( ) 和get_thread_area(
) 系统调用分别为执行进程创建和释放 TLS 段。
Three Thread-Local Storage (TLS)
segments: this is a mechanism that allows multithreaded
applications to make use of up to three segments containing data
local to each thread. The set_thread_area( ) and get_thread_area(
) system calls, respectively, create and release a
TLS segment for the executing process.
与高级电源管理(APM):BIOS代码使用段,因此当Linux APM驱动程序调用BIOS函数来获取或设置APM设备的状态时,它可以使用自定义代码和数据段。
Three segments related to Advanced Power Management (APM ): the BIOS code makes use of segments, so when the Linux APM driver invokes BIOS functions to get or set the status of APM devices, it may use custom code and data segments.
与即插即用(PnP) BIOS 服务。与前面的情况一样,BIOS 代码使用段,因此当 Linux PnP 驱动程序调用 BIOS 函数来检测 PnP 设备使用的资源时,它可能会使用自定义代码和数据段。
Five segments related to Plug and Play (PnP ) BIOS services. As in the previous case, the BIOS code makes use of segments, so when the Linux PnP driver invokes BIOS functions to detect the resources used by PnP devices, it may use custom code and data segments.
A special TSS segment used by the kernel to handle "Double fault " exceptions (see "Exceptions" in Chapter 4).
如前所述,系统中的每个处理器都有一个 GDT 副本。除少数情况外,GDT 的所有副本都存储相同的条目。首先,每个处理器都有自己的TSS段,因此对应的GDT的条目不同。此外,GDT 中的一些条目可能取决于 CPU 正在执行的进程(LDT 和 TLS 段描述符)。最后,在某些情况下,处理器可能会临时修改其 GDT 副本中的条目;例如,当调用 APM 的 BIOS 过程时,就会发生这种情况。
As stated earlier, there is a copy of the GDT for each processor in the system. All copies of the GDT store identical entries, except for a few cases. First, each processor has its own TSS segment, thus the corresponding GDT's entries differ. Moreover, a few entries in the GDT may depend on the process that the CPU is executing (LDT and TLS Segment Descriptors). Finally, in some cases a processor may temporarily modify an entry in its copy of the GDT; this happens, for instance, when invoking an APM's BIOS procedure.
大多数Linux用户模式应用程序不使用本地描述符表,因此内核定义了一个由大多数进程共享的默认LDT。默认的本地描述符表存储在default_ldt数组中。它包括五个条目,但内核只有效使用其中两个:iBCS 可执行文件的调用门和 Solaris 的调用门/x86 可执行文件(请参阅第 20 章中的“执行域”
部分)。调用门是80×86微处理器提供的一种机制,用于在调用预定义函数时改变CPU的特权级别;由于我们不会进一步讨论它们,因此您应该查阅英特尔文档以了解更多详细信息。
Most Linux User Mode applications do not make use of a
Local Descriptor Table, thus the kernel defines a default LDT to be
shared by most processes. The default Local Descriptor Table is stored
in the default_ldt array. It
includes five entries, but only two of them are effectively used by
the kernel: a call gate for iBCS executables, and a call gate for
Solaris /x86 executables (see the section "Execution Domains" in
Chapter 20). Call
gates are a mechanism provided by 80 × 86 microprocessors
to change the privilege level of the CPU while invoking a predefined
function; as we won't discuss them further, you should consult the
Intel documentation for more details.
然而,在某些情况下,进程可能需要设置自己的 LDT。事实证明,这对于执行面向段的 Microsoft Windows 的应用程序(例如 Wine)非常有用应用程序。这modify_ldt(
) 系统调用允许进程执行此操作。
In some cases, however, processes may require to set up their
own LDT. This turns out to be useful to applications (such as Wine)
that execute segment-oriented Microsoft Windows applications. The modify_ldt(
) system call allows a process to do this.
创建的任何自定义 LDTmodify_ldt(
)也需要其自己的段。当处理器开始执行具有自定义 LDT 的进程时,GDT 的 CPU 特定副本中的 LDT 条目会相应更改。
Any custom LDT created by modify_ldt(
) also requires its own segment. When a processor starts
executing a process having a custom LDT, the LDT entry in the
CPU-specific copy of the GDT is changed accordingly.
用户模式应用程序还可以通过以下方式分配新段modify_ldt( ):然而,内核从不使用这些段,并且不必跟踪相应的段描述符,因为它们包含在进程的自定义 LDT 中。
User Mode applications also may allocate new segments by means
of modify_ldt( ); the kernel,
however, never makes use of these segments, and it does not have to
keep track of the corresponding Segment Descriptors, because they are
included in the custom LDT of the process.
分页单元将线性地址转换为物理地址。该单元的一项关键任务是根据线性地址的访问权限检查请求的访问类型。如果内存访问无效,则会产生页面错误例外(参见第 4 章和第8 章)。
The paging unit translates linear addresses into physical ones. One key task in the unit is to check the requested access type against the access rights of the linear address. If the memory access is not valid, it generates a Page Fault exception (see Chapter 4 and Chapter 8).
为了提高效率,线性地址被分组在称为页的固定长度间隔中 ; 页内的连续线性地址被映射到连续的物理地址。这样,内核就可以指定页面的物理地址和访问权限,而不是指定页面中包含的所有线性地址的物理地址和访问权限。按照通常的约定,我们将使用术语“页”来指代一组线性地址以及包含在这组地址中的数据。
For the sake of efficiency, linear addresses are grouped in fixed-length intervals called pages ; contiguous linear addresses within a page are mapped into contiguous physical addresses. In this way, the kernel can specify the physical address and the access rights of a page instead of those of all the linear addresses included in it. Following the usual convention, we shall use the term "page" to refer both to a set of linear addresses and to the data contained in this group of addresses.
寻呼单元将所有 RAM 视为划分为固定长度的 页框 (有时称为物理页 )。每个页框包含一个页面,即页框的长度与页面的长度一致。页框是主存储器的组成部分,因此它是一个存储区域。区分页面和页框很重要;前者只是一个数据块,可以存储在任何页框或磁盘上。
The paging unit thinks of all RAM as partitioned into fixed-length page frames (sometimes referred to as physical pages ). Each page frame contains a page — that is, the length of a page frame coincides with that of a page. A page frame is a constituent of main memory, and hence it is a storage area. It is important to distinguish a page from a page frame; the former is just a block of data, which may be stored in any page frame or on disk.
将线性地址映射到物理地址的数据结构称为页表 ; 它们存储在主内存中,并且必须在启用分页单元之前由内核正确初始化。
The data structures that map linear to physical addresses are called page tables ; they are stored in main memory and must be properly initialized by the kernel before enabling the paging unit.
从80386开始,所有80×86处理器都支持分页;它是通过设置PG名为的控制寄存器的标志来启用的cr0
。当 时PG = 0,线性地址被解释为物理地址。
Starting with the 80386, all 80 × 86 processors support paging; it
is enabled by setting the PG flag of
a control register named cr0
. When PG = 0, linear
addresses are interpreted as physical addresses.
从 80386 开始,Intel 处理器的分页单元处理 4 KB 页面。
Starting with the 80386, the paging unit of Intel processors handles 4 KB pages.
线性地址的 32 位分为三个字段:
The 32 bits of a linear address are divided into three fields:
最高有效 10 位
The most significant 10 bits
中间10位
The intermediate 10 bits
最低有效 12 位
The least significant 12 bits
线性地址的转换分两步完成,每一步都基于一种类型的转换表。第一个转换表称为页目录,第二个称为页表。[ * ]
The translation of linear addresses is accomplished in two steps, each based on a type of translation table. The first translation table is called the Page Directory, and the second is called the Page Table.[*]
这个两级的目的方案的目的是减少每个进程页表所需的 RAM 量。如果使用简单的一级页表,则需要多达 220 个条目(即,每个条目 4 字节,4 MB RAM)来表示每个进程的页表(如果该进程使用了完整的 4 GB线性地址空间),即使进程不使用该范围内的所有地址。两级方案仅需要进程实际使用的虚拟内存区域的页表,从而减少了内存。
The aim of this two-level scheme is to reduce the amount of RAM required for per-process Page Tables. If a simple one-level Page Table was used, then it would require up to 220 entries (i.e., at 4 bytes per entry, 4 MB of RAM) to represent the Page Table for each process (if the process used a full 4 GB linear address space), even though a process does not use all addresses in that range. The two-level scheme reduces the memory by requiring Page Tables only for those virtual memory regions actually used by a process.
每个活动进程都必须分配有一个页目录。然而,没有必要一次性为一个进程的所有页表分配RAM;仅当进程有效需要时才为页表分配 RAM 会更有效。
Each active process must have a Page Directory assigned to it. However, there is no need to allocate RAM for all Page Tables of a process at once; it is more efficient to allocate RAM for a Page Table only when the process effectively needs it.
正在使用的页目录的物理地址存储在名为的控制寄存器中cr3
。线性地址中的目录字段确定页目录中指向正确页表的条目。地址的表字段又确定页表中包含页框物理地址的条目,该页框包含该页。偏移字段确定页面框架内的相对位置(参见图 2-7)。因为它是 12 位长,所以每页由 4096 字节数据组成。
The physical address of the Page Directory in use is stored in a
control register named cr3
. The Directory field within the linear address
determines the entry in the Page Directory that points to the proper
Page Table. The address's Table field, in turn, determines the entry
in the Page Table that contains the physical address of the page frame
containing the page. The Offset field determines the relative position
within the page frame (see Figure 2-7). Because it is
12 bits long, each page consists of 4096 bytes of data.
目录和表字段都是 10 位长,因此页目录和页表最多可以包含 1,024 个条目。由此可见,页目录最多可以寻址 1024 × 1024 × 4096=2 32 个内存单元,正如您在 32 位地址中所期望的那样。
Both the Directory and the Table fields are 10 bits long, so Page Directories and Page Tables can include up to 1,024 entries. It follows that a Page Directory can address up to 1024 × 1024 × 4096=232 memory cells, as you'd expect in 32-bit addresses.
页目录和页表的条目具有相同的结构。每个条目包含以下字段:
The entries of Page Directories and Page Tables have the same structure. Each entry includes the following fields:
Present旗帜Present flag如果被设置,则所引用的页面(或页表)包含在主存中;如果该标志为 0,则该页不包含在主存储器中,并且操作系统可以将剩余的条目位用于其自身目的。如果执行地址转换所需的页表或页目录的条目已Present
清除标志,则分页单元将线性地址存储在名为cr2
并生成异常 14:页面错误例外。(我们将在第 17 章中看到Linux 如何使用这个字段。)
If it is set, the referred-to page (or Page Table) is
contained in main memory; if the flag is 0, the page is not
contained in main memory and the remaining entry bits may be
used by the operating system for its own purposes. If the entry
of a Page Table or Page Directory needed to perform an address
translation has the Present
flag cleared, the paging unit stores the linear address in a
control register named cr2
and generates exception 14: the Page
Fault exception. (We will see in Chapter 17 how Linux uses
this field.)
由于每个页框的容量为 4 KB,因此其物理地址必须是 4096 的倍数,因此物理地址的 12 个最低有效位始终等于 0。如果该字段引用页目录,则页框包含页表;如果它引用页表,则页框包含一页数据。
Because each page frame has a 4-KB capacity, its physical address must be a multiple of 4096, so the 12 least significant bits of the physical address are always equal to 0. If the field refers to a Page Directory, the page frame contains a Page Table; if it refers to a Page Table, the page frame contains a page of data.
Accessed旗帜Accessed flag设置每次分页单元寻址相应的页框。当选择要换出的页面时,操作系统可以使用该标志。寻呼单元从不重置该标志;这必须由操作系统来完成。
Set each time the paging unit addresses the corresponding page frame. This flag may be used by the operating system when selecting pages to be swapped out. The paging unit never resets this flag; this must be done by the operating system.
Dirty旗帜Dirty flag仅适用于页表条目。每次对页框执行写操作时都会设置它。与Accessed标志一样,Dirty操作系统在选择要换出的页面时可以使用该标志。寻呼单元从不重置该标志;这必须由操作系统来完成。
Applies only to the Page Table entries. It is set each
time a write operation is performed on the page frame. As with
the Accessed flag, Dirty may be used by the operating
system when selecting pages to be swapped out. The paging unit
never resets this flag; this must be done by the operating
system.
Read/Write旗帜Read/Write flag包含页面或页表的访问权限(读/写或读)(请参阅本章后面的“硬件保护方案”部分)。
Contains the access right (Read/Write or Read) of the page or of the Page Table (see the section "Hardware Protection Scheme" later in this chapter).
User/Supervisor
旗帜User/Supervisor
flag包含访问页或页表所需的权限级别(参见后面的“硬件保护方案”一节)。
Contains the privilege level required to access the page or Page Table (see the later section "Hardware Protection Scheme").
PCD和PWT旗帜PCD and PWT flags控制硬件缓存处理页面或页表的方式(请参阅本章后面的“硬件缓存”部分)。
Controls the way the page or Page Table is handled by the hardware cache (see the section "Hardware Cache" later in this chapter).
Page Size旗帜Page Size flag仅适用于页目录条目。如果设置了该条目,则该条目引用 2 MB 或 4 MB 长的页框(请参阅以下部分)。
Applies only to Page Directory entries. If it is set, the entry refers to a 2 MB- or 4 MB-long page frame (see the following sections).
Global旗帜Global flag仅适用于页表条目。Pentium Pro 中引入此标志是为了防止从 TLB 缓存中刷新常用页面(请参阅本章后面的“转换后备缓冲区 (TLB) ”部分)。仅当寄存器的页面全局启用(PGE)标志有效时,它才起作用cr4
已设置。
Applies only to Page Table entries. This flag was
introduced in the Pentium Pro to prevent frequently used pages
from being flushed from the TLB cache (see the section "Translation Lookaside
Buffers (TLB)" later in this chapter). It works only if
the Page Global Enable (PGE)
flag of register cr4
is set.
从 Pentium 型号开始,80 × 86 微处理器引入了扩展分页功能 ,它允许页框大小为 4 MB 而不是 4 KB(见图2-8)。扩展分页用于将大的连续线性地址范围转换为相应的物理地址范围;在这些情况下,内核可以不需要中间页表,从而节省内存并保留 TLB 条目(请参阅“转换后备缓冲区 (TLB) ”部分)。
Starting with the Pentium model, 80 × 86 microprocessors introduce extended paging , which allows page frames to be 4 MB instead of 4 KB in size (see Figure 2-8). Extended paging is used to translate large contiguous linear address ranges into corresponding physical ones; in these cases, the kernel can do without intermediate Page Tables and thus save memory and preserve TLB entries (see the section "Translation Lookaside Buffers (TLB)").
正如上一节中提到的,扩展分页是通过设置Page Size页目录条目的标志来启用的。在这种情况下,分页单元将线性地址的 32 位分为两个字段:
As mentioned in the previous section, extended paging is enabled
by setting the Page Size flag of a
Page Directory entry. In this case, the paging unit divides the 32
bits of a linear address into two fields:
最高有效 10 位
The most significant 10 bits
剩余22位
The remaining 22 bits
扩展分页的页目录条目与正常分页相同,除了:
Page Directory entries for extended paging are the same as for normal paging, except that:
Page Size必须设置该标志。
The Page Size flag must
be set.
20 位物理地址字段中只有 10 个最高有效位是有效的。这是因为每个物理地址都在 4 MB 边界上对齐,因此该地址的 22 个最低有效位为 0。
Only the 10 most significant bits of the 20-bit physical address field are significant. This is because each physical address is aligned on a 4-MB boundary, so the 22 least significant bits of the address are 0.
扩展寻呼与常规寻呼共存;它是通过设置处理器寄存器PSE的标志
来启用的cr4。
Extended paging coexists with regular paging; it is enabled by
setting the PSE flag of the
cr4 processor register.
寻呼单元使用与分段单元不同的保护方案。虽然 80 × 86 处理器允许段有四种可能的特权级别,但只有两种特权级别与页和页表相关,因为特权是由前面“常规分页User/Supervisor”
部分中提到的标志控制的。当此标志为 0 时,仅当小于 3 时才可以对页面进行寻址(这意味着,对于 Linux,当处理器处于内核模式时)。当该标志为1时,该页总是可以被寻址。CPL
The paging unit uses a different protection scheme from
the segmentation unit. While 80 × 86 processors allow four possible
privilege levels to a segment, only two privilege levels are
associated with pages and Page Tables, because privileges are
controlled by the User/Supervisor
flag mentioned in the earlier section "Regular Paging." When
this flag is 0, the page can be addressed only when the CPL is less than 3 (this means, for Linux,
when the processor is in Kernel Mode). When the flag is 1, the page
can always be addressed.
此外,与段关联的不是三种类型的访问权限(读、写和执行),而是只有两种类型的访问权限(读和写)与页关联。如果Read/Write页目录或页表项的标志等于0,则对应的页表或页只能读取;否则可以读写。[ * ]
Furthermore, instead of the three types of access rights (Read,
Write, and Execute) associated with segments, only two types of access
rights (Read and Write) are associated with pages. If the Read/Write flag of a Page Directory or Page
Table entry is equal to 0, the corresponding Page Table or page can
only be read; otherwise it can be read and written.[*]
一个简单的例子将有助于阐明常规分页的工作原理。我们假设内核将线性地址空间分配给0x20000000正在0x2003ffff运行的进程。[ † ]该空间由 64 页组成。我们不关心包含页面的页框的物理地址;事实上,其中一些甚至可能不在主内存中。我们只对页表条目的其余字段感兴趣。
A simple example will help in clarifying how regular paging
works. Let's assume that the kernel assigns the linear address space
between 0x20000000 and 0x2003ffff to a running process.[†] This space consists of exactly 64 pages. We don't care
about the physical addresses of the page frames containing the pages;
in fact, some of them might not even be in main memory. We are
interested only in the remaining fields of the Page Table
entries.
让我们从分配给进程的线性地址的 10 个最高有效位开始,它们被分页单元解释为目录字段。地址以 2 开头,后跟零,因此 10 位都具有相同的值,即
0x080十进制 128。因此,所有地址中的Directory字段指的是进程页目录的第129项。相应的条目必须包含分配给进程的页表的物理地址(见图2-9)。如果没有其他线性地址分配给该进程,则页目录的所有剩余 1,023 个条目都将用零填充。
Let's start with the 10 most significant bits of the linear
addresses assigned to the process, which are interpreted as the
Directory field by the paging unit. The addresses start with a 2
followed by zeros, so the 10 bits all have the same value, namely
0x080 or 128 decimal. Thus the
Directory field in all the addresses refers to the 129th entry of the
process Page Directory. The corresponding entry must contain the
physical address of the Page Table assigned to the process (see Figure 2-9). If no other
linear addresses are assigned to the process, all the remaining 1,023
entries of the Page Directory are filled with zeros.
中间 10 位假定的值(即表字段的值)范围为 0 到0x03f,或十进制 0 到 63。因此,只有页表的前 64 个条目是有效的。其余 960 个条目用零填充。
The values assumed by the intermediate 10 bits, (that is, the
values of the Table field) range from 0 to 0x03f, or from 0 to 63 decimal. Thus, only
the first 64 entries of the Page Table are valid. The remaining 960
entries are filled with zeros.
假设进程需要读取线性地址处的字节0x20021406。该地址由分页单元处理如下:
Suppose that the process needs to read the byte at linear
address 0x20021406. This address is
handled by the paging unit as follows:
目录字段0x80
用于选择0x80页面目录的条目,该条目指向与进程页面关联的页面表。
The Directory field 0x80
is used to select entry 0x80 of
the Page Directory, which points to the Page Table associated with
the process's pages.
表字段0x21用于选择0x21页表的条目,该条目指向包含所需页面的页框。
The Table field 0x21 is
used to select entry 0x21 of
the Page Table, which points to the page frame containing the
desired page.
最后,偏移字段0x406用于选择0x406所需页帧中偏移处的字节。
Finally, the Offset field 0x406 is used to select the byte at
offset 0x406 in the desired
page frame.
如果页表项Present的标志
0x21被清除,则该页不存在于主存中;在这种情况下,分页单元发出页面错误翻译线性地址时出现异常。每当进程尝试访问由0x20000000和分隔的间隔之外的线性地址时,都会发出相同的异常0x2003ffff,因为未分配给该进程的页表条目都用零填充;特别是,他们的Present旗帜都被清除了。
If the Present flag of the
0x21 entry of the Page Table is
cleared, the page is not present in main memory; in this case, the
paging unit issues a Page Fault exception while translating the linear address. The
same exception is issued whenever the process attempts to access
linear addresses outside of the interval delimited by 0x20000000 and 0x2003ffff, because the Page Table entries
not assigned to the process are filled with zeros; in particular,
their Present flags are all
cleared.
处理器支持的 RAM 数量受到连接到地址总线的地址引脚数量的限制。从 80386 到 Pentium 的较旧 Intel 处理器使用 32 位物理地址。理论上,此类系统最多可安装 4 GB RAM;实际上,由于用户态进程的线性地址空间要求,内核无法直接寻址超过 1 GB 的 RAM,我们将在后面的“Linux 中的分页”一节中看到这一点。
The amount of RAM supported by a processor is limited by the number of address pins connected to the address bus. Older Intel processors from the 80386 to the Pentium used 32-bit physical addresses. In theory, up to 4 GB of RAM could be installed on such systems; in practice, due to the linear address space requirements of User Mode processes, the kernel cannot directly address more than 1 GB of RAM, as we will see in the later section "Paging in Linux."
然而,需要同时运行数百或数千个进程的大型服务器需要超过 4 GB 的 RAM,近年来,这给英特尔带来了压力,要求扩大 32 位 80 × 86 支持的 RAM 量建筑学。
However, big servers that need to run hundreds or thousands of processes at the same time require more than 4 GB of RAM, and in recent years this created a pressure on Intel to expand the amount of RAM supported on the 32-bit 80 × 86 architecture.
Intel 通过将其处理器上的地址引脚数量从 32 个增加到 36 个来满足这些要求。从 Pentium Pro 开始,所有 Intel 处理器现在都能够寻址多达 2×36 = 64 GB 的 RAM。然而,只有通过引入将 32 位线性地址转换为 36 位物理地址的新分页机制才能利用增加的物理地址范围。
Intel has satisfied these requests by increasing the number of address pins on its processors from 32 to 36. Starting with the Pentium Pro, all Intel processors are now able to address up to 236 = 64 GB of RAM. However, the increased range of physical addresses can be exploited only by introducing a new paging mechanism that translates 32-bit linear addresses into 36-bit physical ones.
在 Pentium Pro 处理器中,Intel 引入了一种称为物理地址扩展 ( PAE ) 的机制。另一种机制,页面大小扩展(PSE-36),是在Pentium III处理器中引入的,但Linux并没有使用它,我们在本书中不会进一步讨论它。
With the Pentium Pro processor, Intel introduced a mechanism called Physical Address Extension (PAE). Another mechanism, Page Size Extension (PSE-36), was introduced in the Pentium III processor, but Linux does not use it, and we won't discuss it further in this book.
PAE 通过设置控制寄存器中的物理地址扩展 ( PAE) 标志来激活cr4。PS页目录条目中的页大小 ( ) 标志启用大页大小(启用 PAE 时为 2 MB)。
PAE is activated by setting the Physical Address Extension
(PAE) flag in the cr4 control register. The Page Size
(PS) flag in the page directory
entry enables large page sizes (2 MB when PAE is enabled).
Intel 改变了分页机制以支持 PAE。
Intel has changed the paging mechanism in order to support PAE.
64 GB RAM 被分为 2 24 个不同的页框,页表条目的物理地址字段已从 20 位扩展到 24 位。因为 PAE 页表条目必须包含 12 个标志位(在前面的“常规分页”部分中描述)和 24 个物理地址位,总共 36 位,所以页表条目大小已从 32 位翻倍到 64 位位。因此,4 KB PAE 页表包含 512 个条目,而不是 1,024 个。
The 64 GB of RAM are split into 224 distinct page frames, and the physical address field of Page Table entries has been expanded from 20 to 24 bits. Because a PAE Page Table entry must include the 12 flag bits (described in the earlier section "Regular Paging") and the 24 physical address bits, for a grand total of 36, the Page Table entry size has been doubled from 32 bits to 64 bits. As a result, a 4-KB PAE Page Table includes 512 entries instead of 1,024.
引入了新级别的页表,称为页目录指针表 (PDPT),由四个 64 位条目组成。
A new level of Page Table called the Page Directory Pointer Table (PDPT) consisting of four 64-bit entries has been introduced.
这cr3 控制寄存器包含一个27位页目录指针表基地址字段。由于 PDPT 存储在 RAM 的前 4 GB 中并与 32 字节的倍数 (2 5 ) 对齐,因此 27 位足以表示此类表的基地址。
The cr3 control register contains a 27-bit Page Directory
Pointer Table base address field. Because PDPTs are stored in the
first 4 GB of RAM and aligned to a multiple of 32 bytes
(25), 27 bits are sufficient to
represent the base address of such tables.
将线性地址映射到 4 KB 页(PS页目录条目中清除标志)时,线性地址的 32 位按以下方式解释:
cr3指向 PDPT
指向 PDPT 中 4 个可能条目中的第 1 个
指向页面目录中 512 个可能条目中的 1 个
指向页表中 512 个可能条目中的 1 个
4 KB 页的偏移量
When mapping linear addresses to 4 KB pages (PS flag cleared in Page Directory
entry), the 32 bits of a linear address are interpreted in the
following way:
cr3Points to a PDPT
Point to 1 of 4 possible entries in PDPT
Point to 1 of 512 possible entries in Page Directory
Point to 1 of 512 possible entries in Page Table
Offset of 4-KB page
将线性地址映射到 2 MB 页面(PS页目录条目中设置的标志)时,线性地址的 32 位按以下方式解释:
cr3指向 PDPT
指向 PDPT 中 4 个可能条目中的第 1 个
指向页面目录中 512 个可能条目中的 1 个
2 MB 页面的偏移量
When mapping linear addresses to 2-MB pages (PS flag set in Page Directory entry),
the 32 bits of a linear address are interpreted in the following
way:
cr3Points to a PDPT
Point to 1 of 4 possible entries in PDPT
Point to 1 of 512 possible entries in Page Directory
Offset of 2-MB page
总而言之,一旦cr3设置,就可以寻址最多 4 GB 的 RAM。如果我们想要寻址更多 RAM,我们必须在cr3PDPT 中放入新值或更改其内容。然而,PAE 的主要问题是线性地址仍然是 32 位长。这迫使内核程序员重复使用相同的线性地址来映射 RAM 的不同区域。我们将在后面的“ RAM 大小超过 4096 MB 时的最终内核页表”部分中概述当 PAE 启用时 Linux 如何初始化页表显然,PAE并没有扩大进程的线性地址空间,因为它只处理物理地址。而且,只有内核可以修改进程的页表,因此运行在用户态的进程无法使用物理地址大于4GB的空间,另一方面,PAE允许内核利用高达64GB的RAM,从而显着增加系统中的进程数量。
To summarize, once cr3 is
set, it is possible to address up to 4 GB of RAM. If we want to
address more RAM, we'll have to put a new value in cr3 or change the content of the PDPT.
However, the main problem with PAE is that linear addresses are still
32 bits long. This forces kernel programmers to reuse the same linear
addresses to map different areas of RAM. We'll sketch how Linux
initializes Page Tables when PAE is enabled in the later section,
"Final kernel Page Table
when RAM size is more than 4096 MB." Clearly, PAE does not
enlarge the linear address space of a process, because it deals only
with physical addresses. Furthermore, only the kernel can modify the
page tables of the processes, thus a process running in User Mode
cannot use a physical address space larger than 4 GB. On the other
hand, PAE allows the kernel to exploit up to 64 GB of RAM, and thus to
increase significantly the number of processes in the system.
正如我们在前面几节中看到的,两级分页通常由 32 位微处理器使用[ * ]。然而,两级分页并不适合采用 64 位体系结构的计算机。让我们用一个思想实验来解释原因:
As we have seen in the previous sections, two-level paging is commonly used by 32-bit microprocessors[*]. Two-level paging, however, is not suitable for computers that adopt a 64-bit architecture. Let's use a thought experiment to explain why:
首先假设标准页面大小为 4 KB。因为1 KB覆盖了2 10个地址的范围,4 KB覆盖了2 12个地址,所以Offset字段是12位。这使得最多 52 位的线性地址可以在表和目录字段之间分配。如果我们现在决定仅使用 64 位中的 48 位进行寻址(此限制为我们留下了舒适的 256 TB 地址空间!),则剩余的 48 × -12 = 36 位将必须在表和目录字段之间分配。如果我们现在决定为这两个字段中的每一个保留 18 位,则每个进程的页目录和页表都应包含 2 18条目 — 即超过 256,000 个条目。
Start by assuming a standard page size of 4 KB. Because 1 KB
covers a range of 210 addresses, 4 KB
covers 212 addresses, so the Offset field
is 12 bits. This leaves up to 52 bits of the linear address to be
distributed between the Table and the Directory fields. If we now
decide to use only 48 of the 64 bits for addressing (this restriction
leaves us with a comfortable 256 TB address space!), the remaining
48-12 = 36 bits will have to be
split among Table and the Directory fields. If we now decide to
reserve 18 bits for each of these two fields, both the Page Directory
and the Page Tables of each process should include
218 entries — that is, more than 256,000
entries.
因此,64 位处理器的所有硬件分页系统都使用额外的分页级别。使用的级别数取决于处理器的类型。表 2-4总结了 Linux 支持的一些 64 位平台所使用的硬件分页系统的主要特征。请参阅第 1 章中的“硬件依赖性”部分,了解与平台名称相关的硬件的简短描述。
For that reason, all hardware paging systems for 64-bit processors make use of additional paging levels. The number of levels used depends on the type of processor. Table 2-4 summarizes the main characteristics of the hardware paging systems used by some 64-bit platforms supported by Linux. Please refer to the section "Hardware Dependency" in Chapter 1 for a short description of the hardware associated with the platform name.
表 2-4。某些 64 位体系结构中的分页级别
Table 2-4. Paging levels in some 64-bit architectures
平台名称 Platform name | 页面大小 Page size | 使用的地址位数 Number of address bits used | 寻呼级别数 Number of paging levels | 线性地址分割 Linear address splitting |
|---|---|---|---|---|
a该架构支持不同的页面大小;我们选择 Linux 采用的典型页面大小。 a This architecture supports different page sizes; we select a typical page size adopted by Linux. | ||||
α alpha | 8 KB 8 KB a | 43 43 | 3 3 | 10 + 10 + 10 + 13 10 + 10 + 10 + 13 |
ia64 ia64 | 4 KB 4 KB a | 39 39 | 3 3 | 9+9+9+12 9 + 9 + 9 + 12 |
ppc64 ppc64 | 4KB 4 KB | 41 41 | 3 3 | 10+10+9+12 10 + 10 + 9 + 12 |
64号 sh64 | 4KB 4 KB | 41 41 | 3 3 | 10+10+9+12 10 + 10 + 9 + 12 |
x86_64 x86_64 | 4KB 4 KB | 48 48 | 4 4 | 9 + 9 + 9 + 9 + 12 9 + 9 + 9 + 9 + 12 |
正如我们将在本章后面的“ Linux 中的分页”部分中看到的那样,Linux 成功地提供了适合大多数受支持的硬件分页系统的通用分页模型。
As we will see in the section "Paging in Linux" later in this chapter, Linux succeeds in providing a common paging model that fits most of the supported hardware paging systems.
当今的微处理器的时钟速率为几千兆赫,而动态 RAM (DRAM) 芯片的访问时间在数百个时钟周期范围内。这意味着 CPU 在执行需要从 RAM 获取操作数和/或将结果存储到 RAM 的指令时可能会受到相当大的阻碍。
Today's microprocessors have clock rates of several gigahertz, while dynamic RAM (DRAM) chips have access times in the range of hundreds of clock cycles. This means that the CPU may be held back considerably while executing instructions that require fetching operands from RAM and/or storing results into RAM.
引入硬件高速缓存是为了减少 CPU 和 RAM 之间的速度不匹配。它们基于众所周知的 局部性原则 ,它适用于程序和数据结构。这说明由于程序的循环结构以及将相关数据打包到线性数组中,接近最近使用的地址的地址在不久的将来很有可能被使用。因此,引入包含最近使用的代码和数据的更小、更快的存储器是有意义的。为此,80 × 86 架构中引入了一种称为线路的新单元。它由几十个连续字节组成,这些字节以突发模式在慢速 DRAM 和用于实现缓存的快速片上静态 RAM (SRAM) 之间传输。
Hardware cache memories were introduced to reduce the speed mismatch between CPU and RAM. They are based on the well-known locality principle , which holds both for programs and data structures. This states that because of the cyclic structure of programs and the packing of related data into linear arrays, addresses close to the ones most recently used have a high probability of being used in the near future. It therefore makes sense to introduce a smaller and faster memory that contains the most recently used code and data. For this purpose, a new unit called the line was introduced into the 80 × 86 architecture. It consists of a few dozen contiguous bytes that are transferred in burst mode between the slow DRAM and the fast on-chip static RAM (SRAM) used to implement caches.
缓存被细分为行的子集。在一种极端情况下,缓存可以直接映射 ,在这种情况下,主存中的一行始终存储在高速缓存中的完全相同的位置。在另一个极端,缓存是完全关联的 ,这意味着内存中的任何行都可以存储在缓存中的任何位置。但大多数缓存在某种程度上都是 N 路组关联的 ,其中主存的任意一行都可以存储在高速缓存的N行中的任意一行中。例如,一行存储器可以存储在双向组关联高速缓存的两个不同行中。
The cache is subdivided into subsets of lines . At one extreme, the cache can be direct mapped , in which case a line in main memory is always stored at the exact same location in the cache. At the other extreme, the cache is fully associative , meaning that any line in memory can be stored at any location in the cache. But most caches are to some degree N-way set associative , where any line of main memory can be stored in any one of N lines of the cache. For instance, a line of memory can be stored in two different lines of a two-way set associative cache.
如图2-10所示,缓存单元插在分页单元和主存之间。它包括硬件高速缓存和高速缓存控制器。高速缓存存储实际的内存行。高速缓存控制器存储条目数组,高速缓存存储器的每一行一个条目。每个条目都包含一个标签以及一些描述缓存行状态的标志。该标签由一些位组成,这些位允许缓存控制器识别该行当前映射的内存位置。内存物理地址的位通常分为三组:最高有效位对应于标记,中间有效位对应于高速缓存控制器子集索引,最低有效位对应于行内的偏移量。
As shown in Figure 2-10, the cache unit is inserted between the paging unit and the main memory. It includes both a hardware cache memory and a cache controller. The cache memory stores the actual lines of memory. The cache controller stores an array of entries, one entry for each line of the cache memory. Each entry includes a tag and a few flags that describe the status of the cache line. The tag consists of some bits that allow the cache controller to recognize the memory location currently mapped by the line. The bits of the memory's physical address are usually split into three groups: the most significant ones correspond to the tag, the middle ones to the cache controller subset index, and the least significant ones to the offset within the line.
当访问 RAM 存储单元时,CPU 从物理地址中提取子集索引,并将子集中所有行的标记与物理地址的高位进行比较。如果找到与地址高位位相同标记的行,则CPU有缓存命中;否则,它有一个 缓存未命中。
When accessing a RAM memory cell, the CPU extracts the subset index from the physical address and compares the tags of all lines in the subset with the high-order bits of the physical address. If a line with the same tag as the high-order bits of the address is found, the CPU has a cache hit; otherwise, it has a cache miss.
当发生缓存命中时,缓存控制器的行为会有所不同,具体取决于访问类型。对于读操作,控制器从高速缓存行中选择数据并将其传输到CPU寄存器中;RAM不被访问,CPU节省时间,这就是缓存系统被发明的原因。对于写操作,控制器可以实现称为 直写的两种基本策略之一 并回写 。在直写中,控制器始终写入 RAM 和高速缓存行,从而有效地关闭高速缓存以进行写入操作。在回写中,可提供更直接的效率,仅更新高速缓存行,而 RAM 的内容保持不变。当然,回写之后,RAM 最终必须更新。仅当 CPU 执行需要刷新高速缓存条目的指令或发生 FLUSH 硬件信号时(通常在高速缓存未命中之后),高速缓存控制器才将高速缓存行写回 RAM。
When a cache hit occurs, the cache controller behaves differently, depending on the access type. For a read operation, the controller selects the data from the cache line and transfers it into a CPU register; the RAM is not accessed and the CPU saves time, which is why the cache system was invented. For a write operation, the controller may implement one of two basic strategies called write-through and write-back . In a write-through, the controller always writes into both RAM and the cache line, effectively switching off the cache for write operations. In a write-back, which offers more immediate efficiency, only the cache line is updated and the contents of the RAM are left unchanged. After a write-back, of course, the RAM must eventually be updated. The cache controller writes the cache line back into RAM only when the CPU executes an instruction requiring a flush of cache entries or when a FLUSH hardware signal occurs (usually after a cache miss).
当发生高速缓存未命中时,如有必要,高速缓存行将被写入内存,并且正确的行将从 RAM 提取到高速缓存条目中。
When a cache miss occurs, the cache line is written to memory, if necessary, and the correct line is fetched from RAM into the cache entry.
多处理器系统的每个处理器都有一个单独的硬件缓存,因此它们需要额外的硬件电路来同步缓存内容。如图2-11所示,每个CPU都有自己的本地硬件缓存。但现在更新变得更加耗时:每当CPU修改其硬件缓存时,它必须检查其他硬件缓存中是否包含相同的数据;如果是这样,它必须通知另一个CPU用正确的值更新它。此活动通常称为缓存监听 。幸运的是,所有这些都是在硬件级别完成的,与内核无关。
Multiprocessor systems have a separate hardware cache for every processor, and therefore they need additional hardware circuitry to synchronize the cache contents. As shown in Figure 2-11, each CPU has its own local hardware cache. But now updating becomes more time consuming: whenever a CPU modifies its hardware cache, it must check whether the same data is contained in the other hardware cache; if so, it must notify the other CPU to update it with the proper value. This activity is often called cache snooping . Luckily, all this is done at the hardware level and is of no concern to the kernel.
缓存技术正在迅速发展。例如,第一个奔腾型号包括一个称为 L1 缓存的片上缓存。最近的模型还包括其他更大、更慢的片上缓存,称为 L2 缓存、L3 缓存等。缓存级别之间的一致性是在硬件级别实现的。Linux 忽略这些硬件细节并假设只有一个缓存。
Cache technology is rapidly evolving. For example, the first Pentium models included a single on-chip cache called the L1-cache. More recent models also include other larger, slower on-chip caches called the L2-cache, L3-cache, etc. The consistency between the cache levels is implemented at the hardware level. Linux ignores these hardware details and assumes there is a single cache.
CD的旗帜cr0 处理器寄存器用于启用或禁用高速缓存电路。同一寄存器中的标志NW指定高速缓存是使用直写策略还是回写策略。
The CD flag of the cr0 processor register is used to enable or disable the
cache circuitry. The NW flag, in
the same register, specifies whether the write-through or the
write-back strategy is used for the caches.
奔腾高速缓存的另一个有趣的功能是,它允许操作系统将不同的高速缓存管理策略与每个页帧关联起来。为此,每个页目录和每个页表条目都包括两个标志:PCD
(Page Cache Disable),它指定在访问页帧中包含的数据时是否必须启用或禁用高速缓存;( PWTPage Write-Through),指定在将数据写入页框时是否必须应用回写策略或直写策略。Linux 清除
PCDandPWT所有页目录和页表条目的标志;因此,所有页框都启用缓存,并且始终采用回写策略进行写入。
Another interesting feature of the Pentium cache is that it lets
an operating system associate a different cache management policy with
each page frame. For this purpose, each Page Directory and each Page
Table entry includes two flags: PCD
(Page Cache Disable), which specifies whether the cache must be
enabled or disabled while accessing data included in the page frame;
and PWT (Page Write-Through), which
specifies whether the write-back or the write-through strategy must be
applied while writing data into the page frame. Linux clears the
PCD and PWT flags of all Page Directory and Page
Table entries; as a result, caching is enabled for all page frames,
and the write-back strategy is always adopted for writing.
除了通用硬件缓存之外,80 × 86 处理器还包括另一个称为转换后备缓冲区( TLB ) 的缓存,以加速线性地址转换。当第一次使用线性地址时,通过慢速访问RAM中的页表来计算相应的物理地址。然后,物理地址被存储在 TLB 条目中,以便可以快速转换对同一线性地址的进一步引用。
Besides general-purpose hardware caches, 80 × 86 processors include another cache called Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is used for the first time, the corresponding physical address is computed through slow accesses to the Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to the same linear address can be quickly translated.
在多处理器系统中,每个CPU都有自己的TLB,称为 CPU的本地TLB 。与硬件缓存相反,TLB的相应条目不需要同步,因为在现有CPU上运行的进程可能将相同的线性地址与不同的物理地址相关联。
In a multiprocessor system, each CPU has its own TLB, called the local TLB of the CPU. Contrary to the hardware cache, the corresponding entries of the TLB need not be synchronized, because processes running on the existing CPUs may associate the same linear address with different physical ones.
当。。。的时候cr3 当CPU的控制寄存器被修改时,硬件会自动使本地TLB的所有条目无效,因为正在使用一组新的页表,并且TLB指向旧数据。
When the cr3 control register of a CPU is modified, the hardware
automatically invalidates all entries of the local TLB, because a new
set of page tables is in use and the TLBs are pointing to old data.
[ * ]在下面的讨论中,小写的“页表”术语表示存储线性地址和物理地址之间的映射的任何页,而大写的“页表”术语表示页表最后一级中的页。
[*] In the discussion that follows, the lowercase "page table" term denotes any page storing the mapping between linear and physical addresses, while the capitalized "Page Table" term denotes a page in the last level of page tables.
[ * ]最近的 Intel Pentium 4 处理器NX在每个 64 位页表条目 (PAE) 中都有一个 (No eXecute) 标志必须启用,请参阅本章后面的“物理地址扩展 (PAE) 分页机制”部分)。Linux 2.6.11 支持此硬件功能。
[*] Recent Intel Pentium 4 processors sport an NX (No eXecute) flag in each 64-bit Page
Table entry (PAE must be enabled, see the section "The Physical Address
Extension (PAE) Paging Mechanism" later in this chapter).
Linux 2.6.11 supports this hardware feature.
[ † ]正如我们将在接下来的章节中看到的,3 GB 线性地址空间是一个上限,但用户模式进程只允许引用它的一个子集。
[†] As we shall see in the following chapters, the 3 GB linear address space is an upper limit, but a User Mode process is allowed to reference only a subset of it.
[ * ]启用 PAE 的 80 × 86 处理器中引入的第三级分页只是为了将页目录和页表中的条目数从 1024 减少到 512。这将页表条目从 32 位扩大到 64 位,以便它们可以存储物理地址的 24 个最高有效位。
[*] The third level of paging present in 80 × 86 processors with PAE enabled has been introduced only to lower from 1024 to 512 the number of entries in the Page Directory and Page Tables. This enlarges the Page Table entries from 32 bits to 64 bits so that they can store the 24 most significant bits of the physical address.
Linux 采用适合 32 位和 64 位体系结构的通用分页模型。正如前面的“ 64 位体系结构的分页”一节中所解释的,对于 32 位体系结构来说,两个分页级别就足够了,而 64 位体系结构则需要更多的分页级别。截至版本 2.6.10,Linux 分页模型由三个分页级别组成。从2.6.11版本开始,四级采用分页模型。[ * ]图2-12所示的四种类型的页表分别称为:
Linux adopts a common paging model that fits both 32-bit and 64-bit architectures. As explained in the earlier section "Paging for 64-bit Architectures," two paging levels are sufficient for 32-bit architectures, while 64-bit architectures require a higher number of paging levels. Up to version 2.6.10, the Linux paging model consisted of three paging levels. Starting with version 2.6.11, a four-level paging model has been adopted.[*] The four types of page tables illustrated in Figure 2-12 are called:
页面全球目录
Page Global Directory
页面上层目录
Page Upper Directory
页面中间目录
Page Middle Directory
页表
Page Table
页面全局目录包括多个页面上层目录的地址,页面上层目录又包括多个页面中层目录的地址,页面中层目录又包括多个页表的地址。每个页表条目都指向一个页框。因此,线性地址最多可以分为五个部分。图2-12没有显示位数,因为每个部分的大小取决于计算机体系结构。
The Page Global Directory includes the addresses of several Page Upper Directories, which in turn include the addresses of several Page Middle Directories, which in turn include the addresses of several Page Tables. Each Page Table entry points to a page frame. Thus the linear address can be split into up to five parts. Figure 2-12 does not show the bit numbers, because the size of each part depends on the computer architecture.
对于没有物理地址扩展的 32 位体系结构,两个分页级别就足够了。Linux 本质上消除了页上目录和页中目录字段,因为它们包含零位。然而,页面上层目录和页面中间目录在指针序列中的位置被保留,以便相同的代码可以在32位和64位体系结构上工作。内核通过将页面上层目录和页面中间目录中的条目数设置为1并将这两个条目映射到页面全局目录的正确条目来保留页面上层目录和页面中间目录的位置。
For 32-bit architectures with no Physical Address Extension, two paging levels are sufficient. Linux essentially eliminates the Page Upper Directory and the Page Middle Directory fields by saying that they contain zero bits. However, the positions of the Page Upper Directory and the Page Middle Directory in the sequence of pointers are kept so that the same code can work on 32-bit and 64-bit architectures. The kernel keeps a position for the Page Upper Directory and the Page Middle Directory by setting the number of entries in them to 1 and mapping these two entries into the proper entry of the Page Global Directory.
对于启用物理地址扩展的 32 位体系结构,使用三个分页级别。Linux的页全局目录对应80×86的页目录指针表,取消页上层目录,页中间目录对应80×86的页目录,Linux的页表对应80×86的页表。
For 32-bit architectures with the Physical Address Extension enabled, three paging levels are used. The Linux's Page Global Directory corresponds to the 80 × 86's Page Directory Pointer Table, the Page Upper Directory is eliminated, the Page Middle Directory corresponds to the 80 × 86's Page Directory, and the Linux's Page Table corresponds to the 80 × 86's Page Table.
最后,对于 64 位架构,根据硬件执行的线性地址位分割,使用三级或四级分页(参见表2-2)。
Finally, for 64-bit architectures three or four levels of paging are used depending on the linear address bit splitting performed by the hardware (see Table 2-2).
Linux 对进程的处理很大程度上依赖于分页。事实上,线性地址自动转换为物理地址使得以下设计目标变得可行:
Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear addresses into physical ones makes the following design objectives feasible:
为每个进程分配不同的物理地址空间,确保有效防止寻址错误。
Assign a different physical address space to each process, ensuring an efficient protection against addressing errors.
将页(数据组)与页框(主内存中的物理地址)区分开来。这允许将相同的页面存储在页面框架中,然后保存到磁盘并稍后重新加载到不同的页面框架中。这是虚拟内存机制的基本组成部分(参见第17章)。
Distinguish pages (groups of data) from page frames (physical addresses in main memory). This allows the same page to be stored in a page frame, then saved to disk and later reloaded in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 17).
在本章的剩余部分中,为了具体起见,我们将参考 80 × 86 处理器使用的分页电路。
In the remaining part of this chapter, we will refer for the sake of concreteness to the paging circuitry used by the 80 × 86 processors.
正如我们将在第 9 章中看到的,每个进程都有自己的页全局目录和自己的页表集。当进程切换发生时(参见第3章“进程切换”一节),Linux会保存cr3 控制寄存器存储在先前执行的进程的描述符中,然后加载cr3存储在下一个要执行的进程的描述符中的值。因此,当新进程恢复在 CPU 上执行时,分页单元会引用正确的页表集。
As we will see in Chapter
9, each process has its own Page Global Directory and its own set
of Page Tables. When a process switch occurs (see the section "Process Switch" in Chapter 3), Linux saves the cr3 control register in the descriptor of the process
previously in execution and then loads cr3 with the value stored in the descriptor of
the process to be executed next. Thus, when the new process resumes its
execution on the CPU, the paging unit refers to the correct set of Page
Tables.
将线性映射到物理地址现在成为一项机械任务,尽管它仍然有些复杂。本章接下来的几节是相当乏味的函数和宏列表,它们检索内核查找地址和管理表所需的信息;大多数函数只有一两行长。您现在可能只想浏览一下这些部分,但了解这些函数和宏的作用很有用,因为您会在本书的讨论中经常看到它们。
Mapping linear to physical addresses now becomes a mechanical task, although it is still somewhat complex. The next few sections of this chapter are a rather tedious list of functions and macros that retrieve information the kernel needs to find addresses and manage the tables; most of the functions are one or two lines long. You may want to only skim these sections now, but it is useful to know the role of these functions and macros, because you'll see them often in discussions throughout this book.
The following macros simplify Page Table handling:
PAGE_SHIFTPAGE_SHIFT指定 Offset 字段的长度(以位为单位);当应用于 80 × 86 处理器时,它产生值 12。因为页面中的所有地址都必须适合 Offset 字段,所以 80 × 86 系统上页面的大小为 2 12 或熟悉的 4,096字节;因此, 12
PAGE_SHIFT可以被视为总页面大小的以 2 为底的对数。该宏用于PAGE_SIZE返回页面的大小。最后,PAGE_MASK宏生成值
0xfffff000并用于屏蔽 Offset 字段的所有位。
Specifies the length in bits of the Offset field; when
applied to 80 × 86 processors, it yields the value 12. Because
all the addresses in a page must fit in the Offset field, the
size of a page on 80 × 86 systems is
212 or the familiar 4,096 bytes; the
PAGE_SHIFT of 12 can thus be
considered the logarithm base 2 of the total page size. This
macro is used by PAGE_SIZE to
return the size of the page. Finally, the PAGE_MASK macro yields the value
0xfffff000 and is used to
mask all the bits of the Offset field.
PMD_SHIFTPMD_SHIFT线性地址的偏移量和表字段的总长度(以位为单位);换句话说,页面中间目录条目可以映射的区域大小的对数。该PMD_SIZE宏计算由页中间目录(即页表)的单个条目映射的区域的大小。该PMD_MASK宏用于屏蔽 Offset 和 Table 字段的所有位。
当 PAE 被禁用时,PMD_SHIFT产生值 22(Offset 中的 12 加 Table 中的 10),PMD_SIZE产生 2 22或 4 MB,并PMD_MASK产生0xffc00000. 相反,当 PAE 启用时,PMD_SHIFT产生值 21(偏移量中的 12 加表中的 9),PMD_SIZE产生 2 21或 2 MB,并PMD_MASK产生0xffe00000.
大页不使用最后一级页表,因此LARGE_PAGE_SIZE,它产生大页的大小,等于PMD_SIZE(2 PMD_SHIFT) 而LARGE_PAGE_MASK,它用于屏蔽大页中 Offset 和 Table 字段的所有位。页地址,等于PMD_MASK。
The total length in bits of the Offset and Table fields of
a linear address; in other words, the logarithm of the size of
the area a Page Middle Directory entry can map. The PMD_SIZE macro computes the size of
the area mapped by a single entry of the Page Middle Directory —
that is, of a Page Table. The PMD_MASK macro is used to mask all the
bits of the Offset and Table fields.
When PAE is disabled, PMD_SHIFT yields the value 22 (12 from
Offset plus 10 from Table), PMD_SIZE yields
222 or 4 MB, and PMD_MASK yields 0xffc00000. Conversely, when PAE is
enabled, PMD_SHIFT yields the
value 21 (12 from Offset plus 9 from Table), PMD_SIZE yields
221 or 2 MB, and PMD_MASK yields 0xffe00000.
Large pages do not make use of the last level of page
tables, thus LARGE_PAGE_SIZE,
which yields the size of a large page, is equal to PMD_SIZE (2PMD_SHIFT) while LARGE_PAGE_MASK, which is used to mask
all the bits of the Offset and Table fields in a large page
address, is equal to PMD_MASK.
PUD_SHIFTPUD_SHIFT确定页上层目录条目可以映射的区域大小的对数。该PUD_SIZE宏计算页面全局目录的单个条目映射的区域的大小。该PUD_MASK宏用于屏蔽 Offset、Table、Middle Air 和 Upper Air 字段的所有位。
在 80 × 86 处理器上,PUD_SHIFT始终等于PMD_SHIFT4 PUD_SIZEMB 或 2 MB。
Determines the logarithm of the size of the area a Page
Upper Directory entry can map. The PUD_SIZE macro computes the size of
the area mapped by a single entry of the Page Global Directory.
The PUD_MASK macro is used to
mask all the bits of the Offset, Table, Middle Air, and Upper
Air fields.
On the 80 × 86 processors, PUD_SHIFT is always equal to PMD_SHIFT and PUD_SIZE is equal to 4 MB or 2
MB.
PGDIR_SHIFTPGDIR_SHIFT确定页面全局目录条目可以映射的区域大小的对数。该PGDIR_SIZE宏计算页面全局目录的单个条目映射的区域的大小。该PGDIR_MASK宏用于屏蔽 Offset、Table、Middle Air 和 Upper Air 字段的所有位。
当禁用 PAE 时,PGDIR_SHIFT产生值 22(与PMD_SHIFT和产生的值相同PUD_SHIFT),PGDIR_SIZE产生 2 22或 4 MB,并PGDIR_MASK产生0xffc00000。相反,当启用 PAE 时,PGDIR_SHIFT产生值 30(偏移量中的 12 加表中的 9 加中空气中的 9),PGDIR_SIZE
产生 2 30或 1 GB,并PGDIR_MASK产生0xc0000000.
Determines the logarithm of the size of the area that a
Page Global Directory entry can map. The PGDIR_SIZE macro computes the size of
the area mapped by a single entry of the Page Global Directory.
The PGDIR_MASK macro is used
to mask all the bits of the Offset, Table, Middle Air, and Upper
Air fields.
When PAE is disabled, PGDIR_SHIFT yields the value 22 (the
same value yielded by PMD_SHIFT and by PUD_SHIFT), PGDIR_SIZE yields
222 or 4 MB, and PGDIR_MASK yields 0xffc00000. Conversely, when PAE is
enabled, PGDIR_SHIFT yields
the value 30 (12 from Offset plus 9 from Table plus 9 from
Middle Air), PGDIR_SIZE
yields 230 or 1 GB, and PGDIR_MASK yields 0xc0000000.
PTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, 和PTRS_PER_PGDPTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, and PTRS_PER_PGD计算页表、页中间目录、页上层目录和页全局目录中的条目数。当 PAE 禁用时,它们分别产生值 1,024、1、1 和 1,024;当启用 PAE 时,值分别为 512、512、1 和 4。
Compute the number of entries in the Page Table, Page Middle Directory, Page Upper Directory, and Page Global Directory. They yield the values 1,024, 1, 1, and 1,024, respectively, when PAE is disabled; and the values 512, 512, 1, and 4, respectively, when PAE is enabled.
pte_t、pmd_t、pud_t、pgd_t分别描述页表、页中间目录、页上层目录和页全局目录条目的格式。当启用 PAE 时,它们是 64 位数据类型,否则为 32 位数据类型。pgprot_t是另一种 64 位(PAE 启用)或 32 位(PAE 禁用)数据类型,表示与单个条目关联的保护标志。
pte_t, pmd_t, pud_t, and pgd_t describe the format of, respectively,
a Page Table, a Page Middle Directory, a Page Upper Directory, and a
Page Global Directory entry. They are 64-bit data types when PAE is
enabled and 32-bit data types otherwise. pgprot_t is another 64-bit (PAE enabled) or
32-bit (PAE disabled) data type that represents the protection flags
associated with a single entry.
五个类型转换宏 — _ _
pte、_ _ pmd、_ _ pud、_ _
pgd和_ _ pgprot— 将无符号整数转换为所需类型。其他五个类型转换宏 — pte_val、
pmd_val、pud_val、pgd_val和pgprot_val— 执行从前面提到的四种专用类型之一到无符号整数的反向转换。
Five type-conversion macros — _ _
pte, _ _ pmd, _ _ pud, _ _
pgd, and _ _ pgprot —
cast an unsigned integer into the required type. Five other
type-conversion macros — pte_val,
pmd_val, pud_val, pgd_val, and pgprot_val — perform the reverse casting
from one of the four previously mentioned specialized types into an
unsigned integer.
内核还提供了几个宏和函数来读取或修改页表条目:
The kernel also provides several macros and functions to read or modify page table entries:
pte_none、pmd_none、 、pud_none,pgd_none如果相应条目的值为 0,则生成值 1;否则,它们产生值 0。
pte_none, pmd_none, pud_none, and pgd_none yield the value 1 if the
corresponding entry has the value 0; otherwise, they yield the
value 0.
pte_clear、pmd_clear、pud_clear、 并pgd_clear清除相应页表项,从而禁止进程使用该页表项映射的线性地址。该ptep_get_and_clear( )函数清除页表条目并返回先前的值。
pte_clear, pmd_clear, pud_clear, and pgd_clear clear an entry of the
corresponding page table, thus forbidding a process to use the
linear addresses mapped by the page table entry. The ptep_get_and_clear( ) function clears a
Page Table entry and returns the previous value.
set_pte、set_pmd、set_pud、 并将set_pgd给定值写入页表项;set_pte_atomic与 相同set_pte,但当启用 PAE 时,它还确保以原子方式写入 64 位值。
set_pte, set_pmd, set_pud, and set_pgd write a given value into a page
table entry; set_pte_atomic is
identical to set_pte, but when
PAE is enabled it also ensures that the 64-bit value is written
atomically.
pte_same(a,b)如果两个页表条目引用同一页并指定相同的访问权限,则返回 1,否则返回a0
b。
pte_same(a,b) returns 1
if two Page Table entries a and
b refer to the same page and
specify the same access privileges, 0 otherwise.
pmd_large(e)如果页面中间目录条目e引用大页面(2 MB 或 4 MB),则返回 1,否则返回 0。
pmd_large(e) returns 1 if
the Page Middle Directory entry e refers to a large page (2 MB or 4 MB),
0 otherwise.
pmd_bad函数使用该宏来检查作为输入参数传递的页面中间目录条目。如果条目指向错误页表,则它会生成值 1 — 也就是说,如果至少满足以下条件之一:
The pmd_bad macro is used by
functions to check Page Middle Directory entries passed as input
parameters. It yields the value 1 if the entry points to a bad Page
Table — that is, if at least one of the following conditions
applies:
该页不在主内存中(Present标志已清除)。
The page is not in main memory (Present flag cleared).
该页面仅允许读取访问(Read/Write标志被清除)。
The page allows only Read access (Read/Write flag cleared).
或Accessed被
Dirty清除(Linux 总是强制为每个现有页表设置这些标志)。
Either Accessed or
Dirty is cleared (Linux always
forces these flags to be set for every existing Page
Table).
和宏总是产生 0。没有定义宏,因为页表条目引用主存中不存在、不可写或根本不可访问的页是合法的pud_bad。pgd_badpte_bad
The pud_bad and pgd_bad macros always yield 0. No pte_bad macro is defined, because it is
legal for a Page Table entry to refer to a page that is not present in
main memory, not writable, or not accessible at all.
如果页表条目的标志或标志等于 1,则宏产生值 1,否则产生值0 pte_present。回想一下,
页表条目中的标志对于微处理器的分页单元没有任何意义;然而,内核将主内存中存在但没有读、写或执行权限的页面标记
为等于 0 和等于 1。这样,对此类页面的任何访问都会触发页面错误PresentPage SizePage SizePresentPage Size因为Present清除了异常,内核可以通过检查 的值来判断故障不是由于缺页引起的Page Size。
The pte_present macro yields
the value 1 if either the Present
flag or the Page Size flag of a
Page Table entry is equal to 1, the value 0 otherwise. Recall that the
Page Size flag in Page Table
entries has no meaning for the paging unit of the microprocessor; the
kernel, however, marks Present
equal to 0 and Page Size equal to 1
for the pages present in main memory but without read, write, or
execute privileges. In this way, any access to such pages triggers a
Page Fault exception because Present is cleared, and the kernel can
detect that the fault is not due to a missing page by checking the
value of Page Size.
如果相应条目的标志等于 1,也就是说,如果相应的页或页表已加载到主内存中,则宏pmd_present将生成值 1 。Present和pud_present宏pgd_present始终产生值 1。
The pmd_present macro yields
the value 1 if the Present flag of
the corresponding entry is equal to 1 — that is, if the corresponding
page or Page Table is loaded in main memory. The pud_present and pgd_present macros always yield the value
1.
表 2-5中列出的函数查询页表条目中包含的任何标志的当前值;除 之外pte_file(),这些函数仅在返回 1 的页表条目上正常工作pte_present。
The functions listed in Table 2-5 query the
current value of any of the flags included in a Page Table entry; with
the exception of pte_file(), these
functions work properly only on Page Table entries for which pte_present returns 1.
表 2-5。页标志读取函数
Table 2-5. Page flag reading functions
函数名称 Function name | 描述 Description |
|---|---|
| 读取 Reads the |
| 读取 Reads the |
| 读取 Reads the |
| 读取 Reads the |
| 读取 Reads the |
| 读取 Reads the |
| 读取 Reads the |
表 2-6中列出的另一组函数设置页表条目中的标志值。
Another group of functions listed in Table 2-6 sets the value of the flags in a Page Table entry.
表 2-6。页标志设置函数
Table 2-6. Page flag setting functions
函数名称 Function name | 描述 Description |
|---|---|
| 设置页表条目的 Sets the |
| 清除 Clears the |
| 清除 Clears the |
| 清除 Clears the |
| 设置 Sets the |
| 设置 Sets the |
| 设置 Sets the |
| 清除 Clears the |
| 设置 Sets the |
| 清除 Clears the |
| 设置 Sets the |
| 将页表条目中的所有访问权限设置 Sets all access rights in a Page
Table entry |
| 类似 Like |
| 如果 If the |
| 类似 Like |
| 类似 Like |
| 类似 Like |
现在,我们来讨论表2-7中列出的宏,它们将页地址和一组保护标志组合到页表项中,或者执行从页表项中提取页地址的逆操作。请注意,其中一些宏通过“页面描述符”的线性地址(请参阅第 8 章中的“页面描述符”部分)而不是页面本身的线性地址来引用页面。
Now, let's discuss the macros listed in Table 2-7 that combine a page address and a group of protection flags into a page table entry or perform the reverse operation of extracting the page address from a page table entry. Notice that some of these macros refer to a page through the linear address of its "page descriptor" (see the section "Page Descriptors" in Chapter 8) rather than the linear address of the page itself.
表 2-7。作用于页表条目的宏
Table 2-7. Macros acting on Page Table entries
宏名称 Macro name | 描述 Description |
|---|---|
pgd_index(地址) pgd_index(addr) | 产生映射线性地址的页面全局目录中条目的索引(相对位置) Yields the index (relative position)
of the entry in the Page Global Directory that maps the linear
address |
| 接收内存描述符的地址 Receives as parameters the address
of a memory descriptor |
| 产生主内核页面全局目录中条目的线性地址对应的地址(参见后面的“内核页表 Yields the linear address of the
entry in the master kernel Page Global Directory that corresponds to the address |
| 产生包含页面全局目录条目引用的页面上层目录的页面框架的页面描述符地址 Yields the page descriptor address
of the page frame containing the Page Upper Directory referred
to by the Page Global Directory entry |
| 接收指向页面全局目录条目的指针
Receives as parameters a pointer
|
| 产生由页面上层目录条目引用的页面中间目录的线性地址 Yields the linear address of the
Page Middle Directory referred to by the Page Upper Directory
entry |
pmd_index(地址) pmd_index(addr) | 产生映射线性地址的页面中间目录中条目的索引(相对位置) Yields the index (relative position)
of the entry in the Page Middle Directory that maps the linear
address |
| 接收指向页上层目录条目的指针
Receives as parameters a pointer
|
| 产生页中间目录条目引用的页表的页描述符地址 Yields the page descriptor address
of the Page Table referred to by the Page Middle Directory
entry |
| 接收页面描述符的地址 Receives as parameters the address
of a page descriptor |
pte_index(地址) pte_index(addr) | 产生页表中映射线性地址的条目的索引(相对位置)
Yields the index (relative position)
of the entry in the Page Table that maps the linear address
|
|
Yields the linear address of the
Page Table that corresponds to the linear address |
| 接收指向页面中间目录条目的指针
Receives as parameters a pointer
|
| 返回页表项引用的页的页描述符地址 Returns the page descriptor address
of the page referenced by the Page Table entry |
| 从页表条目的内容中提取 Extracts from the content |
| 为属于非线性文件内存映射的页面设置页表条目的内容。 Sets up the content of a Page Table entry for a page belonging to a non-linear file memory mapping. |
这个长列表的最后一组函数是为了简化页表条目的创建和删除而引入的。
The last group of functions of this long list was introduced to simplify the creation and deletion of page table entries.
当两级使用分页时,创建或删除页面中间目录条目很简单。正如我们在本节前面所解释的,页中间目录包含指向从属页表的单个条目。因此,页面中间目录条目 也是页面全局目录内的条目。然而,在处理页表时,创建条目可能会更复杂,因为应该包含它的页表可能不存在。在这种情况下,有必要分配一个新的页框,用零填充它,并添加条目。
When two-level paging is used, creating or deleting a Page Middle Directory entry is trivial. As we explained earlier in this section, the Page Middle Directory contains a single entry that points to the subordinate Page Table. Thus, the Page Middle Directory entry is the entry within the Page Global Directory, too. When dealing with Page Tables, however, creating an entry may be more complex, because the Page Table that is supposed to contain it might not exist. In such cases, it is necessary to allocate a new page frame, fill it with zeros, and add the entry.
如果启用PAE,内核使用三级寻呼。当内核创建新的Page Global Directory时,它还会分配相应的4个Page Middle Directory;仅当父页面全局目录被释放时,它们才会被释放。
If PAE is enabled, the kernel uses three-level paging. When the kernel creates a new Page Global Directory, it also allocates the four corresponding Page Middle Directories; these are freed only when the parent Page Global Directory is released.
当使用两级或三级分页时,页面上层目录条目始终映射为页面全局目录中的单个条目。
When two or three-level paging is used, the Page Upper Directory entry is always mapped as a single entry within the Page Global Directory.
与往常一样,表2-8中列出的功能描述是指80×86架构。
As usual, the description of the functions listed in Table 2-8 refers to the 80 × 86 architecture.
表 2-8。页面分配函数
Table 2-8. Page allocation functions
函数名称 Function name | 描述 Description |
|---|---|
| 分配一个新的页面全局目录;如果启用 PAE,它还会分配映射用户模式线性地址的三个子页面中间目录。在 80 × 86 架构上,该参数 Allocates a new Page Global
Directory; if PAE is enabled, it also allocates the three
children Page Middle Directories that map the User Mode linear
addresses. The argument |
| 发布地址为页面全局目录 Releases the Page Global Directory
at address |
| 在两级或三级分页系统中,该函数不执行任何操作:它只是返回页面全局目录条目的线性地址 In a two- or three-level paging
system, this function does nothing: it simply returns the
linear address of the Page Global Directory entry |
| 在两级或三级分页系统中,该宏不执行任何操作。 In a two- or three-level paging system, this macro does nothing. |
| 这样定义的通用三级分页系统可以为线性地址分配一个新的分页中间目录 Defined so generic three-level
paging systems can allocate a new Page Middle Directory for
the linear address |
| 不执行任何操作,因为页面中间目录与其父页面全局目录一起分配和释放。 Does nothing, because Page Middle Directories are allocated and deallocated together with their parent Page Global Directory. |
| 接收页中间目录条目的地址 Receives as parameters the address
of a Page Middle Directory entry |
pte_alloc_kernel(毫米,pmd,地址) pte_alloc_kernel(mm, pmd, addr) | 如果
If the Page Middle Directory entry
|
|
Releases the Page Table associated
with the |
| 相当于 Equivalent to |
|
Clears the contents of the page
tables of a process from linear address |
在初始化阶段,内核必须构建 物理地址映射 它指定哪些物理地址范围可供内核使用,哪些不可用(因为它们映射硬件设备的 I/O 共享内存,或者因为相应的页帧包含 BIOS 数据)。
During the initialization phase the kernel must build a physical addresses map that specifies which physical address ranges are usable by the kernel and which are unavailable (either because they map hardware devices' I/O shared memory or because the corresponding page frames contain BIOS data).
The kernel considers the following page frames as reserved :
那些落在不可用的物理地址范围内的
Those falling in the unavailable physical address ranges
那些包含内核代码和初始化数据结构的
Those containing the kernel's code and initialized data structures
保留页框中包含的页永远不能动态分配或交换到磁盘。
A page contained in a reserved page frame can never be dynamically assigned or swapped to disk.
作为一般规则,Linux 内核从物理地址开始安装在 RAM 中0x00100000,即从第二兆字节开始。所需的页框总数取决于内核的配置方式。典型配置产生的内核可以在不到 3 MB 的 RAM 中加载。
As a general rule, the Linux kernel is installed in RAM starting
from the physical address 0x00100000 — i.e., from the second megabyte.
The total number of page frames required depends on how the kernel is
configured. A typical configuration yields a kernel that can be loaded
in less than 3 MB of RAM.
为什么内核不从第一个可用的 RAM 开始加载?嗯,PC 架构有几个必须考虑的特性。例如:
Why isn't the kernel loaded starting with the first available megabyte of RAM? Well, the PC architecture has several peculiarities that must be taken into account. For example:
页框0被BIOS用来存储加电自检(POST)期间检测到的系统硬件配置;而且,许多笔记本电脑的 BIOS 即使在系统初始化后也会在该页框中写入数据。
Page frame 0 is used by BIOS to store the system hardware configuration detected during the Power-On Self-Test(POST); the BIOS of many laptops, moreover, writes data on this page frame even after the system is initialized.
从0x000a0000到 的物理地址0x000fffff通常保留给 BIOS 例程并映射 ISA 显卡的内部存储器。该区域是所有 IBM 兼容 PC 中众所周知的从 640 KB 到 1 MB 的漏洞:物理地址存在但被保留,并且相应的页框无法被操作系统使用。
Physical addresses ranging from 0x000a0000 to 0x000fffff are usually reserved to BIOS
routines and to map the internal memory of ISA graphics cards.
This area is the well-known hole from 640 KB to 1 MB in all
IBM-compatible PCs: the physical addresses exist but they are
reserved, and the corresponding page frames cannot be used by the
operating system.
第一兆字节内的附加页框可以由特定的计算机型号保留。例如,IBM ThinkPad 将0xa0页面框架映射到页面框架中0x9f。
Additional page frames within the first megabyte may be
reserved by specific computer models. For example, the IBM
ThinkPad maps the 0xa0 page
frame into the 0x9f one.
在引导序列的早期阶段(参见附录 A),内核查询 BIOS 并了解物理内存的大小。在最近的计算机中,内核还调用 BIOS 过程来构建物理地址范围及其相应内存类型的列表。
In the early stage of the boot sequence (see Appendix A), the kernel queries the BIOS and learns the size of the physical memory. In recent computers, the kernel also invokes a BIOS procedure to build a list of physical address ranges and their corresponding memory types.
随后,内核执行该machine_specific_memory_setup( )函数,该函数构建物理地址映射(示例请参见表2-9 )。当然,如果 BIOS 列表可用,内核会根据 BIOS 列表构建此表;否则,内核将按照保守的默认设置构建表:所有编号从
0x9f( LOWMEMSIZE( )) 到0x100( HIGH_MEMORY) 的页框都标记为保留。
Later, the kernel executes the machine_specific_memory_setup( ) function,
which builds the physical addresses map (see Table 2-9 for an example).
Of course, the kernel builds this table on the basis of the BIOS list,
if this is available; otherwise the kernel builds the table following
the conservative default setup: all page frames with numbers from
0x9f (LOWMEMSIZE( )) to 0x100 (HIGH_MEMORY) are marked as reserved.
表 2-9。BIOS 提供的物理地址映射示例
Table 2-9. Example of BIOS-provided physical addresses map
开始 Start | 结尾 End | 类型 Type |
|---|---|---|
| | 可用 Usable |
| | 预订的 Reserved |
| | 可用 Usable |
| | |
| | ACPI NVS ACPI NVS |
| | 预订的 Reserved |
表 2-9显示了具有 128 MB RAM 的计算机的典型配置。物理地址范围从0x07ff0000到0x07ff2fff存储BIOS在POST阶段写入的系统硬件设备的信息;在初始化阶段,内核将这些信息复制到合适的内核数据结构中,然后认为这些页帧可用。0x07ff3000反之, to的物理地址范围0x07ffffff则映射到硬件设备的ROM芯片上。物理地址范围从0xffff0000被标记为保留,因为它由硬件映射到 BIOS 的 ROM 芯片(参见附录 A)。请注意,BIOS 可能不提供某些物理地址范围的信息(在表中,范围是0x000a0000到0x000effff)。为了安全起见,Linux 假设这样的范围不可用。
A typical configuration for a computer having 128 MB of RAM is
shown in Table 2-9.
The physical address range from 0x07ff0000 to 0x07ff2fff stores information about the
hardware devices of the system written by the BIOS in the POST phase;
during the initialization phase, the kernel copies such information in
a suitable kernel data structure, and then considers these page frames
usable. Conversely, the physical address range of 0x07ff3000 to 0x07ffffff is mapped to ROM chips of the
hardware devices. The physical address range starting from 0xffff0000 is marked as reserved, because it
is mapped by the hardware to the BIOS's ROM chip (see Appendix A).
Notice that the BIOS may not provide information for some physical
address ranges (in the table, the range is 0x000a0000 to 0x000effff). To be on the safe side, Linux
assumes that such ranges are not usable.
内核可能看不到 BIOS 报告的所有物理内存:例如,如果内核未使用 PAE 支持进行编译,则只能寻址 4 GB RAM,即使实际上有大量物理内存可用。该setup_memory( )函数在之后立即调用machine_specific_memory_setup(
):它分析物理内存区域表并初始化一些描述内核物理内存布局的变量。这些变量如表 2-10所示。
The kernel might not see all physical memory reported by the
BIOS: for instance, the kernel can address only 4 GB of RAM if it has
not been compiled with PAE support, even if a larger amount of
physical memory is actually available. The setup_memory( ) function is invoked right
after machine_specific_memory_setup(
): it analyzes the table of physical memory regions and
initializes a few variables that describe the kernel's physical memory
layout. These variables are shown in Table 2-10.
表 2-10。描述内核物理内存布局的变量
Table 2-10. Variables describing the kernel's physical memory layout
变量名 Variable name | 描述 Description |
|---|---|
| 最高可用页框的页框号 Page frame number of the highest usable page frame |
| 可用页框总数 Total number of usable page frames |
最小低pfn min_low_pfn | RAM中内核映像之后的第一个可用页框的页框号 Page frame number of the first usable page frame after the kernel image in RAM |
| 最后可用页框的页框号 Page frame number of the last usable page frame |
| 内核直接映射的最后一个页帧的页帧号(低内存) Page frame number of the last page frame directly mapped by the kernel (low memory) |
总页数 totalhigh_pages | 未由内核直接映射的页框总数(高内存) Total number of page frames not directly mapped by the kernel (high memory) |
| 内核未直接映射的第一个页框的页框号 Page frame number of the first page frame not directly mapped by the kernel |
| 内核未直接映射的最后一个页框的页框号 Page frame number of the last page frame not directly mapped by the kernel |
为了避免将内核加载到不连续的页帧组中,Linux 倾向于跳过 RAM 的第一个兆字节。显然,Linux 将使用 PC 体系结构未保留的页框来存储动态分配的页面。
To avoid loading the kernel into groups of noncontiguous page frames, Linux prefers to skip the first megabyte of RAM. Clearly, page frames not reserved by the PC architecture will be used by Linux to store dynamically assigned pages.
图 2-13 显示了 Linux 如何填充前 3 MB RAM。我们假设内核需要少于 3 MB 的 RAM。
Figure 2-13 shows how the first 3 MB of RAM are filled by Linux. We have assumed that the kernel requires less than 3 MB of RAM.
符号_text对应物理地址0x00100000,表示内核代码第一个字节的地址。内核代码的结尾同样由符号 标识_etext。内核数据分为两组:
已初始化和
未初始化。初始化数据从 开始_etext,到 结束_edata。未初始化的数据跟随并结束于_end。
The symbol _text, which
corresponds to physical address 0x00100000, denotes the address of the first
byte of kernel code. The end of the kernel code is similarly
identified by the symbol _etext.
Kernel data is divided into two groups:
initialized and
uninitialized. The initialized data starts right
after _etext and ends at _edata. The uninitialized data follows and
ends up at _end.
图中出现的符号在Linux源代码中并没有定义;它们是在编译内核时生成的。[ * ]
The symbols appearing in the figure are not defined in Linux source code; they are produced while compiling the kernel.[*]
The linear address space of a process is divided into two parts:
当进程在用户模式或内核模式下运行时,可以对从0x00000000到的线性地址进行寻址。0xbfffffff
Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the
process runs in either User or Kernel Mode.
仅当进程在内核模式下运行时,才能对从0xc0000000到的线性地址进行寻址。0xffffffff
Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when
the process runs in Kernel Mode.
当进程在用户模式下运行时,它会发出小于0xc0000000; 的线性地址。当它运行在内核模式时,它正在执行内核代码并且发出的线性地址大于或等于0xc0000000。然而,在某些情况下,内核必须访问用户模式线性地址空间来检索或存储数据。
When a process runs in User Mode, it issues linear addresses
smaller than 0xc0000000; when it
runs in Kernel Mode, it is executing kernel code and the linear
addresses issued are greater than or equal to 0xc0000000. In some cases, however, the
kernel must access the User Mode linear address space to retrieve or
store data.
宏PAGE_OFFSET产生值0xc0000000;这是内核所在进程的线性地址空间中的偏移量。在本书中,我们经常直接引用数字0xc0000000。
The PAGE_OFFSET macro yields
the value 0xc0000000; this is the
offset in the linear address space of a process where the kernel
lives. In this book, we often refer directly to the number 0xc0000000 instead.
映射线性地址低于0xc0000000(禁用 PAE 的前 768 个条目,或启用 PAE 的前 3 个条目)的页面全局目录的第一个条目的内容取决于具体进程。相反,所有进程的其余条目应该相同,并且等于主内核页面全局目录的相应条目(参见下一节)。
The content of the first entries of the Page Global Directory
that map linear addresses lower than 0xc0000000 (the first 768 entries with PAE
disabled, or the first 3 entries with PAE enabled) depends on the
specific process. Conversely, the remaining entries should be the same
for all processes and equal to the corresponding entries of the master
kernel Page Global Directory (see the following section).
内核维护一组供自己使用的页表,植根于所谓的主内核页面全局目录。系统初始化后,这组页表永远不会被任何进程或内核线程直接使用;相反,主内核页面全局目录的最高条目是系统中每个常规进程的页面全局目录的相应条目的参考模型。
The kernel maintains a set of page tables for its own use, rooted at a so-called master kernel Page Global Directory. After system initialization, this set of page tables is never directly used by any process or kernel thread; rather, the highest entries of the master kernel Page Global Directory are the reference model for the corresponding entries of the Page Global Directories of every regular process in the system.
我们在第 8 章的“非连续内存区域的线性地址”部分中解释了内核如何确保对主内核页面全局目录的更改传播到进程实际使用的页面全局目录。
We explain how the kernel ensures that changes to the master kernel Page Global Directory are propagated to the Page Global Directories that are actually used by processes in the section "Linear Addresses of Noncontiguous Memory Areas" in Chapter 8.
现在我们描述内核如何初始化它自己的页表。这是一个两阶段的活动。事实上,内核映像加载到内存后,CPU 仍然运行在实模式下;因此,未启用分页。
We now describe how the kernel initializes its own page tables. This is a two-phase activity. In fact, right after the kernel image is loaded into memory, the CPU is still running in real mode; thus, paging is not enabled.
在第一阶段,内核创建有限的地址空间,包括内核的代码和数据段、初始页表以及一些动态数据结构的 128 KB。这个最小的地址空间刚好足够在 RAM 中安装内核并初始化其核心数据结构。
In the first phase, the kernel creates a limited address space including the kernel's code and data segments, the initial Page Tables, and 128 KB for some dynamic data structures. This minimal address space is just large enough to install the kernel in RAM and to initialize its core data structures.
在第二阶段,内核利用所有现有的 RAM 并正确设置页表。让我们看看这个计划是如何执行的。
In the second phase, the kernel takes advantage of all of the existing RAM and sets up the page tables properly. Let us examine how this plan is executed.
临时页面全球目录在内核编译期间静态初始化,而临时页表由arch/i386/kernel/head.Sstartup_32( )中定义的汇编语言函数初始化 。我们不再费心提及页面上层目录和页面中层目录,因为它们等同于页面全局目录条目。此阶段未启用 PAE 支持。
A provisional Page Global Directory is initialized statically during kernel compilation,
while the provisional Page Tables are initialized by the startup_32( ) assembly language function
defined in arch/i386/kernel/head.S . We won't bother mentioning the Page Upper
Directories and Page Middle Directories anymore, because they are
equated to Page Global Directory entries. PAE support is not enabled
at this stage.
临时页面全局目录包含在
swapper_pg_dir变量中。临时页表从 开始存储pg0,就在内核未初始化数据段末尾之后(图 2-13_end中的符号)。为了简单起见,我们假设内核的段、临时页表和 128 KB 内存区域适合前 8 MB RAM。为了映射 8 MB RAM,需要两个页表。
The provisional Page Global Directory is contained in the
swapper_pg_dir variable. The
provisional Page Tables are stored starting from pg0, right after the end of the kernel's
uninitialized data segments (symbol _end in Figure 2-13). For the
sake of simplicity, let's assume that the kernel's segments, the
provisional Page Tables, and the 128 KB memory area fit in the first
8 MB of RAM. In order to map 8 MB of RAM, two Page Tables are
required.
分页第一阶段的目标是允许在实模式和保护模式下轻松寻址这 8 MB RAM。0x00000000
因此,内核必须创建从线性地址through0x007fffff以及线性地址0xc0000000through
0xc07fffff到物理地址0x00000000through
的映射0x007fffff。换句话说,内核在初始化的第一阶段可以通过与物理地址相同的线性地址或从 开始的 8 MB 线性地址来寻址 RAM 的前 8 MB
0xc0000000。
The objective of this first phase of paging is to allow these
8 MB of RAM to be easily addressed both in real mode and protected
mode. Therefore, the kernel must create a mapping from both the
linear addresses 0x00000000
through 0x007fffff and the linear
addresses 0xc0000000 through
0xc07fffff into the physical
addresses 0x00000000 through
0x007fffff. In other words, the
kernel during its first phase of initialization can address the
first 8 MB of RAM by either linear addresses identical to the
physical ones or 8 MB worth of linear addresses, starting from
0xc0000000.
内核通过用
swapper_pg_dir零填充所有条目(条目 0、1、0x300(十进制 768)和0x301(十进制 769)除外)来创建所需的映射;后两个条目跨越0xc0000000和之间的所有线性地址0xc07fffff。0、1、0x300...和0x301条目初始化如下:
The Kernel creates the desired mapping by filling all the
swapper_pg_dir entries with
zeroes, except for entries 0, 1, 0x300 (decimal 768), and 0x301 (decimal 769); the latter two
entries span all linear addresses between 0xc0000000 and 0xc07fffff. The 0, 1, 0x300, and 0x301 entries are initialized as
follows:
条目0和的地址字段0x300被设置为物理地址pg0,而条目1和的地址字段0x301
被设置为后面的页框的物理地址
pg0。
The address field of entries 0 and 0x300 is set to the physical address
of pg0, while the address
field of entries 1 and 0x301
is set to the physical address of the page frame following
pg0.
、和标志在所有四个条目Present中均已设置。Read/WriteUser/Supervisor
The Present, Read/Write, and User/Supervisor flags are set in all
four entries.
所有四个条目中的Accessed、Dirty、PCD、PWD和标志均被清除。Page
Size
The Accessed, Dirty, PCD, PWD, and Page
Size flags are cleared in all four entries.
汇编startup_32( )语言函数还启用分页单元。这是通过将物理地址加载swapper_pg_dir到cr3 控制寄存器并通过设置PG标志cr0 控制寄存器,如下面的等效代码片段所示:
The startup_32( ) assembly
language function also enables the paging unit. This is achieved by
loading the physical address of swapper_pg_dir into the cr3 control register and by setting the PG flag of the cr0 control register, as shown in the following
equivalent code fragment:
movl $swapper_pg_dir-0xc0000000,%eax
movl %eax,%cr3 /* 设置页表指针.. */
movl %cr0,%eax
或 $0x80000000,%eax
movl %eax,%cr0 /* ..并设置分页(PG)位 */ movl $swapper_pg_dir-0xc0000000,%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
orl $0x80000000,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */内核页表提供的最终映射必须将从0xc00000000开始的线性地址转换为从0开始的物理地址。
The final mapping provided by the kernel page tables
must transform linear addresses starting from 0xc0000000 into physical addresses
starting from 0.
该_ _pa宏用于将从 开始的线性地址转换PAGE_OFFSET为相应的物理地址,而_ _va宏则相反。
The _ _pa macro is used to
convert a linear address starting from PAGE_OFFSET to the corresponding physical
address, while the _ _va macro
does the reverse.
主内核页面全局目录仍然存储在swapper_pg_dir. 它由
paging_init( )函数初始化,该函数执行以下操作:
The master kernel Page Global Directory is still stored in swapper_pg_dir. It is initialized by the
paging_init( ) function, which
does the following:
调用pagetable_init(
)以正确设置页表条目。
Invokes pagetable_init(
) to set up the Page Table entries properly.
将物理地址写入控制swapper_pg_dir寄存器cr3。
Writes the physical address of swapper_pg_dir in the cr3 control register.
If the CPU supports PAE and if the kernel is compiled with PAE support,
sets the PAE flag in the
cr4 control register.
调用_ _flush_tlb_all(
)以使所有 TLB 条目无效。
Invokes _ _flush_tlb_all(
) to invalidate all TLB entries.
执行的操作pagetable_init( )取决于现有的 RAM 量和 CPU 型号。让我们从最简单的情况开始。我们的计算机拥有小于 896 MB [ * ]的 RAM,32 位物理地址足以寻址所有可用的 RAM,并且不需要激活 PAE 机制。(请参阅前面的“物理地址扩展 (PAE) 分页机制”部分。)
The actions performed by pagetable_init( ) depend on both the
amount of RAM present and on the CPU model. Let's start with the
simplest case. Our computer has less than 896 MB[*] of RAM, 32-bit physical addresses are sufficient to
address all the available RAM, and there is no need to activate the
PAE mechanism. (See the earlier section "The Physical Address Extension
(PAE) Paging Mechanism.")
页面swapper_pg_dir全局目录通过相当于以下的循环重新初始化:
The swapper_pg_dir Page
Global Directory is reinitialized by a cycle equivalent to the
following:
pgd = swapper_pg_dir + pgd_index(PAGE_OFFSET); /* 768 */
phys_地址 = 0x00000000;
while (phys_addr < (max_low_pfn * PAGE_SIZE)) {
pmd = one_md_table_init(pgd); /* 返回 pgd 本身 */
set_pmd(pmd, _ _pmd(phys_addr | pgprot_val(_ _pgprot(0x1e3))));
/* 0x1e3 == 存在、已访问、脏、读/写、
页面大小,全局 */
phys_addr += PTRS_PER_PTE * PAGE_SIZE;/* 0x400000 */
++pgd;
} pgd = swapper_pg_dir + pgd_index(PAGE_OFFSET); /* 768 */
phys_addr = 0x00000000;
while (phys_addr < (max_low_pfn * PAGE_SIZE)) {
pmd = one_md_table_init(pgd); /* returns pgd itself */
set_pmd(pmd, _ _pmd(phys_addr | pgprot_val(_ _pgprot(0x1e3))));
/* 0x1e3 == Present, Accessed, Dirty, Read/Write,
Page Size, Global */
phys_addr += PTRS_PER_PTE * PAGE_SIZE; /* 0x400000 */
++pgd;
}我们假设 CPU 是最新的 80 × 86 微处理器,支持 4 MB 页和“全局”TLB 条目。请注意,
User/Supervisor引用上述线性地址的所有页面全局目录条目中的标志都0xc0000000被清除,从而拒绝用户模式下的进程访问内核地址空间。还要注意,Page Size设置该标志是为了使内核可以通过使用大页面来寻址 RAM(请参阅本章前面的“扩展分页”部分)。
We assume that the CPU is a recent 80 × 86 microprocessor
supporting 4 MB pages and "global" TLB entries. Notice that the
User/Supervisor flags in all Page
Global Directory entries referencing linear addresses above 0xc0000000 are cleared, thus denying
processes in User Mode access to the kernel address space. Notice
also that the Page Size flag is
set so that the kernel can address the RAM by making use of large
pages (see the section "Extended Paging" earlier
in this chapter).
该函数构建的前兆字节物理内存(在我们的示例中为 8 MB)的恒等映射startup_32( )需要完成内核的初始化阶段。当不再需要此映射时,内核通过调用该zap_low_mappings( )函数来清除相应的页表条目。
The identity mapping of the first megabytes of physical memory
(8 MB in our example) built by the startup_32( ) function is required to
complete the initialization phase of the kernel. When this mapping
is no longer necessary, the kernel clears the corresponding page
table entries by invoking the zap_low_mappings( ) function.
事实上,这个描述并没有说出全部真相。正如我们将在后面的“固定映射线性地址”部分中看到的,内核还调整与“固定映射线性地址”相对应的页表条目 ”。
Actually, this description does not state the whole truth. As we'll see in the later section "Fix-Mapped Linear Addresses," the kernel also adjusts the entries of Page Tables corresponding to the "fix-mapped linear addresses ."
在这种情况下,RAM 无法完全映射到内核线性地址空间。Linux 在初始化阶段可以做的最好的事情就是将大小为 896 MB 的 RAM 窗口映射到内核线性地址空间。如果程序需要对现有 RAM 的其他部分进行寻址,则必须将其他一些线性地址区间映射到所需的 RAM。这意味着更改某些页表条目的值。我们将在第 8 章中讨论如何完成这种动态重新映射。
In this case, the RAM cannot be mapped entirely into the kernel linear address space. The best Linux can do during the initialization phase is to map a RAM window of size 896 MB into the kernel linear address space. If a program needs to address other parts of the existing RAM, some other linear address interval must be mapped to the required RAM. This implies changing the value of some page table entries. We'll discuss how this kind of dynamic remapping is done in Chapter 8.
为了初始化页面全局目录,内核使用与前一种情况相同的代码。
To initialize the Page Global Directory, the kernel uses the same code as in the previous case.
现在让我们考虑一下 4 GB 以上计算机的内核页表初始化;更准确地说,我们处理发生以下情况的情况:
Let's now consider kernel Page Table initialization for computers with more than 4 GB; more precisely, we deal with cases in which the following happens:
尽管PAE处理36位物理地址,但线性地址仍然是32位地址。与前面的情况一样,Linux 将 896 MB RAM 窗口映射到内核线性地址空间;剩余的 RAM 未映射并通过动态重新映射进行处理,如第 8 章所述。与前一种情况的主要区别在于,三级使用分页模型,因此页面全局目录由相当于以下的循环初始化:
Although PAE handles 36-bit physical addresses, linear addresses are still 32-bit addresses. As in the previous case, Linux maps a 896-MB RAM window into the kernel linear address space; the remaining RAM is left unmapped and handled by dynamic remapping, as described in Chapter 8. The main difference with the previous case is that a three-level paging model is used, so the Page Global Directory is initialized by a cycle equivalent to the following:
pgd_idx = pgd_index(PAGE_OFFSET); /* 3 */
for (i=0; i<pgd_idx; i++)
set_pgd(swapper_pg_dir + i, _ _pgd(_ _pa(empty_zero_page) + 0x001));
/* 0x001 == 现在 */
pgd = swapper_pg_dir + pgd_idx;
phys_地址 = 0x00000000;
for (; i<PTRS_PER_PGD; ++i, ++pgd) {
pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
set_pgd(pgd, _ _pgd(_ _pa(pmd) | 0x001)); /* 0x001 == 现在 */
if (phys_addr < max_low_pfn * PAGE_SIZE)
for (j=0; j < PTRS_PER_PMD /* 512 */
&& phys_addr < max_low_pfn*PAGE_SIZE; ++j) {
set_pmd(pmd,_ _pmd(phys_addr |
pgprot_val(_ _pgprot(0x1e3))));
/* 0x1e3 == 存在、已访问、脏、读/写、
页面大小,全局 */
phys_addr += PTRS_PER_PTE * PAGE_SIZE;/* 0x200000 */
}
}
swapper_pg_dir[0] = swapper_pg_dir[pgd_idx]; pgd_idx = pgd_index(PAGE_OFFSET); /* 3 */
for (i=0; i<pgd_idx; i++)
set_pgd(swapper_pg_dir + i, _ _pgd(_ _pa(empty_zero_page) + 0x001));
/* 0x001 == Present */
pgd = swapper_pg_dir + pgd_idx;
phys_addr = 0x00000000;
for (; i<PTRS_PER_PGD; ++i, ++pgd) {
pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
set_pgd(pgd, _ _pgd(_ _pa(pmd) | 0x001)); /* 0x001 == Present */
if (phys_addr < max_low_pfn * PAGE_SIZE)
for (j=0; j < PTRS_PER_PMD /* 512 */
&& phys_addr < max_low_pfn*PAGE_SIZE; ++j) {
set_pmd(pmd, _ _pmd(phys_addr |
pgprot_val(_ _pgprot(0x1e3))));
/* 0x1e3 == Present, Accessed, Dirty, Read/Write,
Page Size, Global */
phys_addr += PTRS_PER_PTE * PAGE_SIZE; /* 0x200000 */
}
}
swapper_pg_dir[0] = swapper_pg_dir[pgd_idx];内核用空页( )的地址初始化与用户线性地址空间对应的页全局目录中的前三个条目empty_zero_page。pmd第四个条目使用通过调用分配的 Page Middle Directory()地址进行初始化alloc_bootmem_low_pages( )。页中间目录中的前 448 个条目(有 512 个条目,但最后 64 个保留用于非连续内存分配;请参阅第 8 章中的“非连续内存区域管理”部分)填充前 896 MB 的物理地址内存。
The kernel initializes the first three entries in the Page
Global Directory corresponding to the user linear address space with
the address of an empty page (empty_zero_page). The fourth entry is
initialized with the address of a Page Middle Directory (pmd) allocated by invoking alloc_bootmem_low_pages( ). The first 448
entries in the Page Middle Directory (there are 512 entries, but the
last 64 are reserved for noncontiguous memory allocation; see the
section "Noncontiguous
Memory Area Management" in Chapter 8) are filled with the
physical address of the first 896 MB of RAM.
请注意,所有支持 PAE 的 CPU 型号还支持大 2 MB 页面和全局页面。与前面的情况一样,只要有可能,Linux 都会使用大页面来减少页表的数量。
Notice that all CPU models that support PAE also support large 2-MB pages and global pages. As in the previous cases, whenever possible, Linux uses large pages to reduce the number of Page Tables.
然后将第四个页面全局目录条目复制到第一个条目中,以便将低物理内存的映射镜像到线性地址空间的前896MB中。为了完成SMP的初始化,需要这个映射系统:当不再需要时,内核通过调用该zap_low_mappings( )函数清除相应的页表项,如前面的情况一样。
The fourth Page Global Directory entry is then copied into the
first entry, so as to mirror the mapping of the low physical memory
in the first 896 MB of the linear address space. This mapping is
required in order to complete the initialization of SMP systems: when it is no longer necessary, the kernel
clears the corresponding page table entries by invoking the zap_low_mappings( ) function, as in the
previous cases.
我们看到第四GB内核线性地址的初始部分映射了系统的物理内存。但是,至少 128 MB 的线性地址始终可用,因为内核使用它们来实现非连续内存分配和固定映射线性地址。
We saw that the initial part of the fourth gigabyte of kernel linear addresses maps the physical memory of the system. However, at least 128 MB of linear addresses are always left available because the kernel uses them to implement noncontiguous memory allocation and fix-mapped linear addresses.
非连续内存分配只是动态分配和释放内存页的一种特殊方式,在第8章“非连续内存区域的线性地址”一节中进行了描述。在本节中,我们重点讨论固定映射的线性地址。
Noncontiguous memory allocation is just a special way to dynamically allocate and release pages of memory, and is described in the section "Linear Addresses of Noncontiguous Memory Areas" in Chapter 8. In this section, we focus on fix-mapped linear addresses.
基本上,固定映射线性地址是一个恒定的线性地址,0xffffc000其对应的物理地址不必是线性地址 minus 0xc000000,而是以任意方式设置的物理地址。因此,每个固定映射的线性地址映射物理存储器的一个页帧。正如我们将在后面的章节中看到的,内核使用固定映射的线性地址而不是永远不会改变其值的指针变量。
Basically, a fix-mapped linear address is a
constant linear address like 0xffffc000 whose corresponding physical
address does not have to be the linear address minus 0xc000000, but rather a physical address set
in an arbitrary way. Thus, each fix-mapped linear address maps one
page frame of the physical memory. As we'll see in later chapters, the
kernel uses fix-mapped linear addresses instead of pointer variables
that never change their value.
固定映射线性地址在概念上类似于映射前 896 MB RAM 的线性地址。然而,固定映射的线性地址可以映射任何物理地址,而由第四GB的初始部分中的线性地址建立的映射是线性的(线性地址X
映射物理地址X-PAGE_OFFSET )。
Fix-mapped linear addresses are conceptually similar to the
linear addresses that map the first 896 MB of RAM. However, a
fix-mapped linear address can map any physical address, while the
mapping established by the linear addresses in the initial portion of
the fourth gigabyte is linear (linear address X
maps physical address X-PAGE_OFFSET).
相对于变量指针,固定映射的线性地址更有效。事实上,取消引用变量指针比取消引用立即常量地址需要多一次内存访问。此外,在取消引用变量指针之前检查它的值是一种很好的编程习惯;相反,对于恒定的线性地址则不需要检查。
With respect to variable pointers, fix-mapped linear addresses are more efficient. In fact, dereferencing a variable pointer requires one memory access more than dereferencing an immediate constant address. Moreover, checking the value of a variable pointer before dereferencing it is a good programming practice; conversely, the check is never required for a constant linear address.
每个固定映射的线性地址都由数据结构中定义的一个小整数索引表示enum
fixed_addresses:
Each fix-mapped linear address is represented by a small integer
index defined in the enum
fixed_addresses data structure:
枚举固定地址{
修复孔,
FIX_VSYSCALL,
FIX_APIC_BASE,
FIX_IO_APIC_BASE_0,
[...]
_ _固定地址结束
}; enum fixed_addresses {
FIX_HOLE,
FIX_VSYSCALL,
FIX_APIC_BASE,
FIX_IO_APIC_BASE_0,
[...]
_ _end_of_fixed_addresses
};固定映射线性地址放置在第四 GB 线性地址的末尾。该fix_to_virt( )函数计算从索引开始的常量线性地址:
Fix-mapped linear addresses are placed at the end of the fourth
gigabyte of linear addresses. The fix_to_virt( ) function computes the
constant linear address starting from the index:
内联无符号长 fix_to_virt(const unsigned int idx)
{
if (idx >= _ _固定地址结束符)
_ _this_fixmap_does_not_exist( );
返回 (0xfffff000UL - (idx << PAGE_SHIFT));
} inline unsigned long fix_to_virt(const unsigned int idx)
{
if (idx >= _ _end_of_fixed_addresses)
_ _this_fixmap_does_not_exist( );
return (0xfffff000UL - (idx << PAGE_SHIFT));
}我们假设某个内核函数调用fix_to_virt(FIX_IOAPIC_BASE_0). 由于该函数被声明为“内联”,因此 C 编译器不会生成对 的调用fix_to_virt( ),而是将其代码插入到调用函数中。此外,对索引值的检查永远不会在运行时执行。事实上,FIX_IOAPIC_BASE_0是一个等于 3 的常量,因此编译器可以删除该if
语句,因为它的条件在编译时为 false。相反,如果条件为真或 的参数fix_to_virt( )不是常量,则编译器会在链接阶段发出错误,因为该符号
_ _this_fixmap_does_not_exist未在任何地方定义。最终,编译器计算0xfffff000-(3<<PAGE_SHIFT)并替换fix_to_virt( )
使用常量线性地址调用函数0xffffc000。
Let's assume that some kernel function invokes fix_to_virt(FIX_IOAPIC_BASE_0). Because the
function is declared as "inline," the C compiler does not generate a
call to fix_to_virt( ), but inserts
its code in the calling function. Moreover, the check on the index
value is never performed at runtime. In fact, FIX_IOAPIC_BASE_0 is a constant equal to 3,
so the compiler can cut away the if
statement because its condition is false at compile time. Conversely,
if the condition is true or the argument of fix_to_virt( ) is not a constant, the
compiler issues an error during the linking phase because the symbol
_ _this_fixmap_does_not_exist is
not defined anywhere. Eventually, the compiler computes 0xfffff000-(3<<PAGE_SHIFT) and
replaces the fix_to_virt( )
function call with the constant linear address 0xffffc000.
为了将物理地址与固定映射的线性地址相关联,内核使用set_fixmap(idx,phys)和set_fixmap_nocache(idx,phys)宏。两者都是fix_to_virt(idx)用物理地址初始化线性地址对应的页表项phys;然而,第二个函数还设置页表条目的标志,从而在访问页帧中的数据时禁用硬件缓存(参见本章前面的“硬件缓存PCD”
部分)。相反,删除固定映射线性地址和物理地址之间的链接。clear_fixmap(idx)idx
To associate a physical address with a fix-mapped linear
address, the kernel uses the set_fixmap(idx,phys) and set_fixmap_nocache(idx,phys) macros. Both of
them initialize the Page Table entry corresponding to the fix_to_virt(idx) linear address with the
physical address phys; however, the
second function also sets the PCD
flag of the Page Table entry, thus disabling the hardware cache when
accessing the data in the page frame (see the section "Hardware Cache" earlier
in this chapter). Conversely, clear_fixmap(idx) removes the linking
between a fix-mapped linear address idx and the physical address.
内存寻址的最后一个主题涉及内核如何优化使用硬件缓存。硬件缓存和转换后备缓冲区在提高现代计算机体系结构的性能方面发挥着至关重要的作用。内核开发人员使用多种技术来减少缓存和 TLB 未命中的数量。
The last topic of memory addressing deals with how the kernel makes an optimal use of the hardware caches. Hardware caches and Translation Lookaside Buffers play a crucial role in boosting the performance of modern computer architectures. Several techniques are used by kernel developers to reduce the number of cache and TLB misses.
正如本章前面提到的,硬件缓存是通过缓存线来寻址的。该L1_CACHE_BYTES宏产生缓存行的大小(以字节为单位)。在 Pentium 4 之前的 Intel 型号上,宏生成值 32;在 Pentium 4 上,它产生的值是 128。
As mentioned earlier in this chapter, hardware caches are
addressed by cache lines. The L1_CACHE_BYTES macro yields the size of a
cache line in bytes. On Intel models earlier than the Pentium 4, the
macro yields the value 32; on a Pentium 4, it yields the value
128.
为了优化缓存命中率,内核在做出以下决策时会考虑架构。
To optimize the cache hit rate, the kernel considers the architecture in making the following decisions.
数据结构中最常用的字段放置在数据结构内的低偏移处,因此它们可以缓存在同一行中。
The most frequently used fields of a data structure are placed at the low offset within the data structure, so they can be cached in the same line.
当分配大量数据结构时,内核尝试将每个数据结构存储在内存中,以便统一使用所有缓存行。
When allocating a large set of data structures, the kernel tries to store each of them in memory in such a way that all cache lines are used uniformly.
高速缓存同步由80×86微处理器自动执行,因此此类处理器的Linux内核不执行任何硬件高速缓存刷新。然而,内核确实为不同步缓存的处理器提供了缓存刷新接口。
Cache synchronization is performed automatically by the 80 × 86 microprocessors, thus the Linux kernel for this kind of processor does not perform any hardware cache flushing. The kernel does provide, however, cache flushing interfaces for processors that do not synchronize caches.
处理器无法自动同步自己的 TLB 缓存,因为是内核(而不是硬件)决定线性地址和物理地址之间的映射何时不再有效。
Processors cannot synchronize their own TLB cache automatically because it is the kernel, and not the hardware, that decides when a mapping between a linear and a physical address is no longer valid.
Linux 2.6 提供了几种 TLB 刷新方法,应根据页表更改的类型适当应用这些方法(参见表2-11)。
Linux 2.6 offers several TLB flush methods that should be applied appropriately, depending on the type of page table change (see Table 2-11).
表 2-11。独立于体系结构的 TLB 无效方法
Table 2-11. Architecture-independent TLB-invalidating methods
方法名称 Method name | 描述 Description | 通常在以下情况下使用 Typically used when |
|---|---|---|
| 刷新所有 TLB 条目(包括引用全局页面的条目,即
Flushes all TLB entries (including
those that refer to global pages, that is, pages whose
| 更改内核页表条目 Changing the kernel page table entries |
| 刷新给定线性地址范围内的所有 TLB 条目(包括引用全局页的条目) Flushes all TLB entries in a given range of linear addresses (including those that refer to global pages) | 更改内核页表条目的范围 Changing a range of kernel page table entries |
| 刷新当前进程拥有的非全局页面的所有 TLB 条目 Flushes all TLB entries of the non-global pages owned by the current process | 执行进程切换 Performing a process switch |
| 刷新给定进程拥有的非全局页面的所有 TLB 条目 Flushes all TLB entries of the non-global pages owned by a given process | 分叉一个新进程 Forking a new process |
| 刷新与给定进程的线性地址间隔相对应的 TLB 条目 Flushes the TLB entries corresponding to a linear address interval of a given process | 释放进程的线性地址区间 Releasing a linear address interval of a process |
| 刷新给定进程的给定连续页表子集的 TLB 条目 Flushes the TLB entries of a given contiguous subset of page tables of a given process | 释放进程的部分页表 Releasing some page tables of a process |
| 刷新给定进程的单个页表条目的 TLB Flushes the TLB of a single Page Table entry of a given process | 处理页面错误 Processing a Page Fault |
尽管通用 Linux 内核提供了丰富的 TLB 方法集,但每个微处理器通常都提供一组限制更严格的 TLB 无效汇编语言指令。在这方面,更灵活的硬件平台之一是Sun的UltraSPARC。相比之下,Intel 微处理器仅提供两种 TLB 无效技术:
Despite the rich set of TLB methods offered by the generic Linux kernel, every microprocessor usually offers a far more restricted set of TLB-invalidating assembly language instructions. In this respect, one of the more flexible hardware platforms is Sun's UltraSPARC. In contrast, Intel microprocessors offers only two TLB-invalidating techniques:
当一个值加载到 TLB 页面时,所有 Pentium 型号都会自动刷新相对于非全局页面的 TLB 条目。
cr3 登记。
All Pentium models automatically flush the TLB entries
relative to non-global pages when a value is loaded into the
cr3 register.
In Pentium Pro and later models, the invlpg assembly language instruction invalidates a
single TLB entry mapping a given linear address.
表 2-12 列出了利用此类硬件技术的 Linux 宏;这些宏是实现表 2-11中列出的与体系结构无关的方法的基本成分。
Table 2-12 lists the Linux macros that exploit such hardware techniques; these macros are the basic ingredients to implement the architecture-independent methods listed in Table 2-11.
表 2-12。Intel Pentium Pro 及更高版本处理器的 TLB 无效宏
Table 2-12. TLB-invalidating macros for the Intel Pentium Pro and later processors
宏名称 Macro name | 描述 Description | 使用者 Used by |
|---|---|---|
| 将 Rewrites | |
_ _flush_tlb_global( ) _ _flush_tlb_global( ) |
Disables global pages by clearing
the | |
_ _flush_tlb_single(地址) _ _flush_tlb_single(addr) | 执行 Executes | |
请注意,表 2-12flush_tlb_pgtables中缺少
该方法:在 80 × 86 体系结构中,当页表与其父表取消链接时无需执行任何操作,因此实现该方法的函数为空。
Notice that the flush_tlb_pgtables method is missing from
Table 2-12: in
the 80 × 86 architecture nothing has to be done when a page table is
unlinked from its parent table, thus the function implementing this
method is empty.
独立于体系结构的 TLB 无效方法可以非常简单地扩展到多处理器系统。CPU 上运行的函数向其他 CPU发送处理器间中断(请参阅第 4 章中的“处理器间中断处理” ),迫使它们执行正确的 TLB 无效函数。
The architecture-independent TLB-invalidating methods are extended quite simply to multiprocessor systems. The function running on a CPU sends an Interprocessor Interrupt (see "Interprocessor Interrupt Handling" in Chapter 4) to the other CPUs that forces them to execute the proper TLB-invalidating function.
作为一般规则,任何进程切换都意味着更改活动页表集。与旧页表相关的本地TLB条目必须被刷新;当内核将新页面全局目录的地址写入
cr3控制寄存器时,这是自动完成的。然而,在以下情况下,内核成功地避免了 TLB 刷新:
As a general rule, any process switch implies changing the set
of active page tables. Local TLB entries relative to the old page
tables must be flushed; this is done automatically when the kernel
writes the address of the new Page Global Directory into the
cr3 control register. The kernel
succeeds, however, in avoiding TLB flushes in the following
cases:
When performing a process switch between two regular processes that use the same set of page tables (see the section "The schedule( ) Function" in Chapter 7).
在常规进程和内核线程之间执行进程切换时。事实上,我们将在第 9 章的“内核线程的内存描述符”部分中看到,内核线程没有自己的一套页表;相反,它们使用最后安排在 CPU 上执行的常规进程所拥有的一组页表。
When performing a process switch between a regular process and a kernel thread. In fact, we'll see in the section "Memory Descriptor of Kernel Threads" in Chapter 9, that kernel threads do not have their own set of page tables; rather, they use the set of page tables owned by the regular process that was scheduled last for execution on the CPU.
除了进程切换之外,还有其他情况内核需要刷新 TLB 中的某些条目。例如,当内核将页框分配给用户模式进程并将其物理地址存储到页表条目中时,它必须刷新引用相应线性地址的任何本地 TLB 条目。在多处理器系统上,内核还必须刷新使用相同页表集(如果有)的 CPU 上的相同 TLB 条目。
Besides process switches, there are other cases in which the kernel needs to flush some entries in a TLB. For instance, when the kernel assigns a page frame to a User Mode process and stores its physical address into a Page Table entry, it must flush any local TLB entry that refers to the corresponding linear address. On multiprocessor systems, the kernel also must flush the same TLB entry on the CPUs that are using the same set of page tables, if any.
为了避免多处理器系统中无用的 TLB 刷新,内核使用了一种称为惰性 TLB 模式的技术 。基本思想如下:如果多个 CPU 使用相同的页表,并且必须在所有 CPU 上刷新 TLB 条目,则在某些情况下,运行内核线程的 CPU 上的 TLB 刷新可能会延迟。
To avoid useless TLB flushing in multiprocessor systems, the kernel uses a technique called lazy TLB mode . The basic idea is the following: if several CPUs are using the same page tables and a TLB entry must be flushed on all of them, then TLB flushing may, in some cases, be delayed on CPUs running kernel threads.
事实上,每个内核线程并没有自己的一组页表;相反,它使用属于常规进程的页表集。然而,不需要使引用用户模式线性地址的TLB条目无效,因为没有内核线程访问用户模式地址空间。[ * ]
In fact, each kernel thread does not have its own set of page tables; rather, it makes use of the set of page tables belonging to a regular process. However, there is no need to invalidate a TLB entry that refers to a User Mode linear address, because no kernel thread accesses the User Mode address space.[*]
当某些CPU开始运行内核线程时,内核将其设置为惰性TLB模式。当发出清除某些TLB条目的请求时,处于懒惰TLB模式的每个CPU不会刷新相应的条目;然而,CPU 会记住其当前进程正在一组页表上运行,这些页表的用户模式地址的 TLB 条目无效。一旦惰性 TLB 模式下的 CPU 切换到具有不同页表集的常规进程,硬件就会自动刷新 TLB 条目,并且内核将 CPU 设置回非惰性 TLB 模式。但是,如果处于惰性 TLB 模式的 CPU 切换到拥有先前运行的内核线程使用的同一组页表的常规进程,那么内核必须有效地应用任何延迟的 TLB 失效。这种“惰性”失效是通过刷新 CPU 的所有非全局 TLB 条目来有效实现的。
When some CPUs start running a kernel thread, the kernel sets it into lazy TLB mode. When requests are issued to clear some TLB entries, each CPU in lazy TLB mode does not flush the corresponding entries; however, the CPU remembers that its current process is running on a set of page tables whose TLB entries for the User Mode addresses are invalid. As soon as the CPU in lazy TLB mode switches to a regular process with a different set of page tables, the hardware automatically flushes the TLB entries, and the kernel sets the CPU back in non-lazy TLB mode. However, if a CPU in lazy TLB mode switches to a regular process that owns the same set of page tables used by the previously running kernel thread, then any deferred TLB invalidation must be effectively applied by the kernel. This "lazy" invalidation is effectively achieved by flushing all non-global TLB entries of the CPU.
实现lazy TLB模式需要一些额外的数据结构。该cpu_tlbstate
变量是一个静态结构数组NR_CPUS(该宏的默认值为32;它表示系统中CPU的最大数量),由一个active_mm指向当前进程的内存描述符的字段(参见第9章)和一个state标志组成只能采用两个值:(TLBSTATE_OK非惰性 TLB 模式)或TLBSTATE_LAZY(惰性 TLB 模式)。此外,每个内存描述符包括一个cpu_vm_mask存储应接收与 TLB 刷新相关的处理器间中断的 CPU 索引的字段。仅当内存描述符属于当前正在执行的进程时,该字段才有意义。
Some extra data structures are needed to implement the lazy
TLB mode. The cpu_tlbstate
variable is a static array of NR_CPUS structures (the default value for
this macro is 32; it denotes the maximum number of CPUs in the
system) consisting of an active_mm field pointing to the memory
descriptor of the current process (see Chapter 9) and a state flag that can assume only two
values: TLBSTATE_OK (non-lazy TLB
mode) or TLBSTATE_LAZY (lazy TLB
mode). Furthermore, each memory descriptor includes a cpu_vm_mask field that stores the indices
of the CPUs that should receive Interprocessor Interrupts related to
TLB flushing. This field is meaningful only when the memory
descriptor belongs to a process currently in execution.
当CPU开始执行内核线程时,内核将state其cpu_tlbstate元素的字段设置为TLBSTATE_LAZY; 此外,cpu_vm_mask活动内存描述符字段存储了系统中所有CPU的索引,包括进入惰性TLB模式的CPU。当另一个 CPU 想要使所有 CPU 相对于给定页表集的 TLB 条目无效时,它会向索引包含在cpu_vm_mask相应内存描述符字段中的所有 CPU 传递处理器间中断。
When a CPU starts executing a kernel thread, the kernel sets
the state field of its cpu_tlbstate element to TLBSTATE_LAZY; moreover, the cpu_vm_mask field of the active memory
descriptor stores the indices of all CPUs in the system, including
the one that is entering in lazy TLB mode. When another CPU wants to
invalidate the TLB entries of all CPUs relative to a given set of
page tables, it delivers an Interprocessor Interrupt to all CPUs
whose indices are included in the cpu_vm_mask field of the corresponding
memory descriptor.
当CPU收到与TLB刷新相关的处理器间中断并验证它影响其当前进程的页表集时,它检查state其cpu_tlbstate元素的字段是否等于TLBSTATE_LAZY。在这种情况下,内核拒绝使 TLB 条目无效,并从cpu_vm_mask内存描述符字段中删除 CPU 索引。这有两个后果:
When a CPU receives an Interprocessor Interrupt related to TLB
flushing and verifies that it affects the set of page tables of its
current process, it checks whether the state field of its cpu_tlbstate element is equal to TLBSTATE_LAZY. In this case, the kernel
refuses to invalidate the TLB entries and removes the CPU index from
the cpu_vm_mask field of the
memory descriptor. This has two consequences:
只要 CPU 保持在惰性 TLB 模式,它就不会接收与 TLB 刷新相关的其他处理器间中断。
As long as the CPU remains in lazy TLB mode, it will not receive other Interprocessor Interrupts related to TLB flushing.
如果 CPU 切换到与被替换的内核线程使用同一组页表的另一个进程,则内核会调用以使_
_flush_tlb( )CPU 的所有非全局 TLB 条目无效。
If the CPU switches to another process that is using the
same set of page tables as the kernel thread that is being
replaced, the kernel invokes _
_flush_tlb( ) to invalidate all non-global TLB entries
of the CPU.
[ * ]进行此更改是为了完全支持 x86_64 平台使用的线性地址位分割(参见表2-4)。
[*] This change has been made to fully support the linear address bit splitting used by the x86_64 platform (see Table 2-4).
[ * ]您可以在文件 System.map中找到这些符号的线性地址,该文件是在内核编译后立即创建的。
[*] You can find the linear address of these symbols in the file System.map, which is created right after the kernel is compiled.
[ * ]最高 128 MB 的线性地址可用于多种映射(请参阅本章后面的“固定映射线性地址”部分和第 8 章中的“非连续内存区域的线性地址”部分)。因此,用于映射 RAM 的内核地址空间为 1 GB − 128 MB = 896 MB。
[*] The highest 128 MB of linear addresses are left available for several kinds of mappings (see sections "Fix-Mapped Linear Addresses" later in this chapter and "Linear Addresses of Noncontiguous Memory Areas" in Chapter 8). The kernel address space left for mapping the RAM is thus 1 GB − 128 MB = 896 MB.
进程的概念是任何多道程序操作系统的基础。进程通常被定义为正在执行的程序的实例;因此,如果 16 个用户同时运行vi ,则有 16 个独立的进程(尽管它们可以共享相同的可执行代码)。在 Linux 源代码中,进程通常称为任务或 线程。
The concept of a process is fundamental to any multiprogramming operating system. A process is usually defined as an instance of a program in execution; thus, if 16 users are running vi at once, there are 16 separate processes (although they can share the same executable code). Processes are often called tasks or threads in the Linux source code.
在本章中,我们讨论进程的静态属性,然后描述内核如何执行进程切换。最后两节描述了如何创建和销毁进程。我们还描述了Linux如何支持多线程应用程序——正如 第1章中提到的,它依赖于所谓的轻量级进程(LWP)。
In this chapter, we discuss static properties of processes and then describe how process switching is performed by the kernel. The last two sections describe how processes can be created and destroyed. We also describe how Linux supports multithreaded applications — as mentioned in Chapter 1, it relies on so-called lightweight processes (LWP).
“过程”一词通常具有多种不同的含义。在本书中,我们坚持通常的操作系统教科书定义: 进程是正在执行的程序的实例。您可能会将其视为数据结构的集合,它完整地描述了程序的执行进度。
The term "process" is often used with several different meanings. In this book, we stick to the usual OS textbook definition: a process is an instance of a program in execution. You might think of it as the collection of data structures that fully describes how far the execution of the program has progressed.
进程就像人类一样:它们被生成,它们具有或多或少重要的生命,它们选择性地生成一个或多个子进程,并最终死亡。一个小小的区别是,性别在进程之间并不常见——每个进程只有一个父进程。
Processes are like human beings: they are generated, they have a more or less significant life, they optionally generate one or more child processes, and eventually they die. A small difference is that sex is not really common among processes — each process has just one parent.
从内核的角度来看,进程的目的是充当分配系统资源(CPU 时间、内存等)的实体。
From the kernel's point of view, the purpose of a process is to act as an entity to which system resources (CPU time, memory, etc.) are allocated.
当一个进程被创建时,它几乎与其父进程相同。它接收父级地址空间的(逻辑)副本,并从进程创建系统调用之后的下一条指令开始执行与父级相同的代码。尽管父级和子级可能共享包含程序代码(文本)的页面,但它们具有单独的数据副本(堆栈和堆),因此子级对内存位置的更改对父级来说是不可见的(反之亦然) 。
When a process is created, it is almost identical to its parent. It receives a (logical) copy of the parent's address space and executes the same code as the parent, beginning at the next instruction following the process creation system call. Although the parent and child may share the pages containing the program code (text), they have separate copies of the data (stack and heap), so that changes by the child to a memory location are invisible to the parent (and vice versa).
虽然早期的 Unix 内核采用了这种简单的模型,但现代 Unix 系统却没有。他们支持多线程应用程序 — 具有许多相对独立的执行流的用户程序共享大部分应用程序数据结构。在这样的系统中,一个进程由多个用户线程组成 (或简称线程),每个线程代表进程的一个执行流程。如今,大多数多线程应用程序都是使用称为pthread(POSIX 线程)库的标准库函数集编写的 。
While earlier Unix kernels employed this simple model, modern Unix systems do not. They support multithreaded applications — user programs having many relatively independent execution flows sharing a large portion of the application data structures. In such systems, a process is composed of several user threads (or simply threads), each of which represents an execution flow of the process. Nowadays, most multithreaded applications are written using standard sets of library functions called pthread (POSIX thread) libraries .
旧版本的 Linux 内核不提供对多线程应用程序的支持。从内核的角度来看,多线程应用程序只是一个普通的进程。多线程应用程序的多个执行流完全在用户模式下创建、处理和调度,通常通过符合 POSIX 的pthread库来实现。
Older versions of the Linux kernel offered no support for multithreaded applications. From the kernel point of view, a multithreaded application was just a normal process. The multiple execution flows of a multithreaded application were created, handled, and scheduled entirely in User Mode, usually by means of a POSIX-compliant pthread library.
然而,这样的多线程应用程序的实现并不是很令人满意。例如,假设一个国际象棋程序使用两个线程:其中一个线程控制图形棋盘,等待人类玩家的走棋并显示计算机的走棋,而另一个线程则思考游戏的下一步棋。当第一个线程等待人类移动时,第二个线程应该连续运行,从而利用人类玩家的思考时间。但是,如果国际象棋程序只是一个进程,则第一个线程不能简单地发出等待用户操作的阻塞系统调用;否则,第二个线程也会被阻塞。相反,第一个线程必须采用复杂的非阻塞技术来确保进程保持可运行。
However, such an implementation of multithreaded applications is not very satisfactory. For instance, suppose a chess program uses two threads: one of them controls the graphical chessboard, waiting for the moves of the human player and showing the moves of the computer, while the other thread ponders the next move of the game. While the first thread waits for the human move, the second thread should run continuously, thus exploiting the thinking time of the human player. However, if the chess program is just a single process, the first thread cannot simply issue a blocking system call waiting for a user action; otherwise, the second thread is blocked as well. Instead, the first thread must employ sophisticated nonblocking techniques to ensure that the process remains runnable.
Linux使用轻量级进程 为多线程应用程序提供更好的支持。基本上,两个轻量级进程可能会共享一些资源,例如地址空间、打开的文件等。每当其中一个修改共享资源时,另一个会立即看到更改。当然,两个进程在访问共享资源时必须实现自身同步。
Linux uses lightweight processes to offer better support for multithreaded applications. Basically, two lightweight processes may share some resources, like the address space, the open files, and so on. Whenever one of them modifies a shared resource, the other immediately sees the change. Of course, the two processes must synchronize themselves when accessing the shared resource.
实现多线程应用程序的一种直接方法是将轻量级进程与每个线程相关联。这样,线程只需共享相同的内存地址空间、同一组打开的文件等,就可以访问同一组应用程序数据结构;同时,每个线程都可以由内核独立调度,以便一个线程可以休眠,而另一个线程仍然可以运行。使用 Linux 轻量级进程的POSIX 兼容pthread库的示例包括LinuxThreads、 Native POSIX Thread Library ( NPTL ) 和 IBM 的下一代 Posix Threading Package ( NGPT )。
A straightforward way to implement multithreaded applications is to associate a lightweight process with each thread. In this way, the threads can access the same set of application data structures by simply sharing the same memory address space, the same set of open files, and so on; at the same time, each thread can be scheduled independently by the kernel so that one may sleep while another remains runnable. Examples of POSIX-compliant pthread libraries that use Linux's lightweight processes are LinuxThreads, Native POSIX Thread Library (NPTL), and IBM's Next Generation Posix Threading Package (NGPT).
符合 POSIX 标准的多线程应用程序最好由支持“线程组”的内核来处理在 Linux 中,线程组基本上是一组轻量级进程,它们实现多线程应用程序,并在某些系统调用方面充当一个整体,例如
getpid( ) ,kill( ) , 和_exit( )
。我们将在本章后面详细描述它们。
POSIX-compliant multithreaded applications are best handled by
kernels that support "thread groups ." In Linux a thread group is
basically a set of lightweight processes that implement a multithreaded
application and act as a whole with regards to some system calls such as
getpid( ) , kill( ) , and _exit( )
. We are going to describe them at length later in this
chapter.
为了管理进程,内核必须清楚地了解每个进程正在做什么。例如,它必须知道进程的优先级、是否在 CPU 上运行或因某个事件而被阻止、已为其分配了哪些地址空间、允许其寻址哪些文件等等。这就是进程描述符的作用——一种task_struct
类型结构,其字段包含与单个进程相关的所有信息。[ * ]作为如此多信息的存储库,进程描述符相当复杂。除了大量包含进程属性的字段之外,进程描述符还包含多个指向其他数据结构的指针,而这些数据结构又包含指向其他结构的指针。图 3-1示意性地描述了 Linux 进程描述符。
To manage processes, the kernel must have a clear picture
of what each process is doing. It must know, for instance, the process's
priority, whether it is running on a CPU or blocked on an event, what
address space has been assigned to it, which files it is allowed to
address, and so on. This is the role of the process
descriptor — a task_struct
type structure whose fields contain all the information related to a
single process.[*] As the repository of so much information, the process
descriptor is rather complex. In addition to a large number of fields
containing process attributes, the process descriptor contains several
pointers to other data structures that, in turn, contain pointers to
other structures. Figure
3-1 describes the Linux process descriptor schematically.
图中右侧的6个数据结构指的是进程所拥有的具体资源。其中大部分资源将在以后的章节中介绍。本章重点介绍两种类型的字段,它们引用流程状态和流程父/子关系。
The six data structures on the right side of the figure refer to specific resources owned by the process. Most of these resources will be covered in future chapters. This chapter focuses on two types of fields that refer to the process state and to process parent/child relationships.
顾名思义,state进程描述符的字段描述了进程当前发生的情况。它由一组标志组成,每个标志都描述一个可能的进程状态。在当前的Linux版本中,这些状态是互斥的,因此只state
设置了一个always标志;其余标志被清除。以下是可能的进程状态:
As its name implies, the state field of the process descriptor
describes what is currently happening to the process. It consists of
an array of flags, each of which describes a possible process state.
In the current Linux version, these states are mutually exclusive, and
hence exactly one flag of state
always is set; the remaining flags are cleared. The following are the
possible process states:
TASK_RUNNINGTASK_RUNNING该进程要么正在 CPU 上执行,要么等待执行。
The process is either executing on a CPU or waiting to be executed.
TASK_INTERRUPTIBLETASK_INTERRUPTIBLE该进程被挂起(休眠),直到某些条件成立。引发硬件中断,释放进程正在等待的系统资源,或交付信号是可能唤醒进程的条件示例(将其状态恢复为TASK_RUNNING)。
The process is suspended (sleeping) until some condition
becomes true. Raising a hardware interrupt, releasing a system
resource the process is waiting for, or delivering a signal are examples of conditions that might
wake up the process (put its state back to TASK_RUNNING).
TASK_UNINTERRUPTIBLETASK_UNINTERRUPTIBLE与 类似TASK_INTERRUPTIBLE,只不过向休眠进程传递信号会使其状态保持不变。这种进程状态很少使用。然而,在某些特定条件下,进程必须等待给定事件发生而不被中断,这是有价值的。例如,当进程打开设备文件并且相应的设备驱动程序开始探测相应的硬件设备时,可以使用该状态。在探测完成之前,不得中断设备驱动程序,否则硬件设备可能会处于不可预测的状态。
Like TASK_INTERRUPTIBLE, except that
delivering a signal to the sleeping process leaves its state
unchanged. This process state is seldom used. It is valuable,
however, under certain specific conditions in which a process
must wait until a given event occurs without being interrupted.
For instance, this state may be used when a process opens a
device file and the corresponding device driver starts probing
for a corresponding hardware device. The device driver must not
be interrupted until the probing is complete, or the hardware
device could be left in an unpredictable state.
TASK_STOPPEDTASK_STOPPED进程执行已停止;进程在收到SIGSTOP、SIGTSTP、SIGTTIN或SIGTTOU信号后进入此状态。
Process execution has been stopped; the process enters
this state after receiving a SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal.
TASK_TRACEDTASK_TRACED进程执行已被调试器停止。当一个进程被另一个进程监视时(例如当调试器执行系统ptrace( )调用来监视测试程序时),每个信号都可以将该进程置于该TASK_TRACED
状态。
Process execution has been stopped by a debugger. When a
process is being monitored by another (such as when a debugger
executes a ptrace( ) system
call to monitor a test program), each signal may put the process
in the TASK_TRACED
state.
进程的两个附加状态可以存储在
进程描述符的state字段和字段中;exit_state顾名思义,进程仅在其执行终止时才达到以下两种状态之一:
Two additional states of the process can be stored both in the
state field and in the exit_state field of the process descriptor;
as the field name suggests, a process reaches one of these two states
only when its execution is terminated:
EXIT_ZOMBIEEXIT_ZOMBIE进程执行已终止,但父进程尚未发出wait4( )
或者waitpid( )
系统调用返回有关死亡进程的信息。[ * ]在wait(
)发出 -like 调用之前,内核无法丢弃死进程描述符中包含的数据,因为父进程可能需要它。(请参阅本章末尾附近的“进程删除”部分。)
Process execution is terminated, but the parent process
has not yet issued a wait4( )
or waitpid( )
system call to return information about the dead
process.[*] Before the wait(
)-like call is issued, the kernel cannot discard the
data contained in the dead process descriptor because the parent
might need it. (See the section "Process Removal"
near the end of this chapter.)
EXIT_DEADEXIT_DEAD最终状态:该进程正在被系统删除,因为父进程刚刚为其发出了wait4( )或系统调用。waitpid( )将其状态从 更改为EXIT_ZOMBIE可以避免EXIT_DEAD由于其他执行线程wait( )在同一进程上执行类似调用而导致的竞争条件(请参阅第 5 章)。
The final state: the process is being removed by the
system because the parent process has just issued a wait4( ) or waitpid( ) system call for it.
Changing its state from EXIT_ZOMBIE to EXIT_DEAD avoids race conditions due
to other threads of execution that execute wait( )-like calls on the same process
(see Chapter
5).
该字段的值state通常通过简单的赋值来设置。例如:
The value of the state field
is usually set with a simple assignment. For instance:
p->状态 = TASK_RUNNING;
p->state = TASK_RUNNING;
内核还使用set_task_state和set_current_state宏:它们分别设置指定进程和当前执行进程的状态。此外,这些宏确保编译器或CPU控制单元不会将赋值操作与其他指令混合。混合指令顺序有时可能会导致灾难性的结果(参见第 5 章)。
The kernel also uses the set_task_state and set_current_state macros: they set the state
of a specified process and of the process currently executed,
respectively. Moreover, these macros ensure that the assignment
operation is not mixed with other instructions by the compiler or the
CPU control unit. Mixing the instruction order may sometimes lead to
catastrophic results (see Chapter
5).
作为一般规则,每个可以独立调度的执行上下文必须有自己的进程描述符;因此,即使是共享大部分内核数据结构的轻量级进程,也有自己的task_struct结构。
As a general rule, each execution context that can be
independently scheduled must have its own process descriptor;
therefore, even lightweight processes, which share a large portion of
their kernel data structures, have their own task_struct structures.
进程和进程描述符之间严格的一一对应,使得该结构体的32位地址[ † ]task_struct
成为内核识别进程的有用手段。这些地址称为进程描述符指针。内核对进程的大多数引用都是通过进程描述符指针进行的。
The strict one-to-one correspondence between the process and
process descriptor makes the 32-bit address[†] of the task_struct
structure a useful means for the kernel to identify processes. These
addresses are referred to as process descriptor
pointers. Most of the references to processes that the
kernel makes are through process descriptor pointers.
另一方面,类 Unix 操作系统允许用户通过称为进程 ID(或PID)的数字来识别进程,该数字存储在
pid进程描述符字段中。PID 按顺序编号:新创建的进程的 PID 通常是先前创建的进程的 PID 加 1。当然,PID值是有上限的;当内核达到此限制时,它必须开始回收较低的未使用的 PID。默认情况下,最大PID号为32767(PID_MAX_DEFAULT - 1);系统管理员可以通过将较小的值写入/proc来减少此限制/sys/kernel/pid_max文件(/proc是特殊文件系统的挂载点,请参阅第 12 章中的“特殊文件系统”
部分)。在64位架构中,系统管理员可以将最大PID号扩大到4,194,303。
On the other hand, Unix-like operating systems allow users to
identify processes by means of a number called the Process
ID (or PID), which is stored in the
pid field of the process
descriptor. PIDs are numbered sequentially: the PID of a newly created
process is normally the PID of the previously created process
increased by one. Of course, there is an upper limit on the PID
values; when the kernel reaches such limit, it must start recycling
the lower, unused PIDs. By default, the maximum PID number is 32,767
(PID_MAX_DEFAULT - 1); the system
administrator may reduce this limit by writing a smaller value into
the /proc /sys/kernel/pid_max file (/proc is the mount point of a special
filesystem, see the section "Special Filesystems" in
Chapter 12). In 64-bit
architectures, the system administrator can enlarge the maximum PID
number up to 4,194,303.
当回收 PID 号时,内核必须管理一个pidmap_array位图,该位图表示哪些是当前分配的 PID,哪些是空闲的 PID。由于页框包含 32,768 位,因此在 32 位体系结构中,pidmap_array位图存储在单个页中。然而,在 64 位体系结构中,当内核分配的 PID 编号对于当前位图大小来说太大时,可能会向位图添加额外的页面。这些页面永远不会被发布。
When recycling PID numbers, the kernel must manage a pidmap_array bitmap that denotes which are
the PIDs currently assigned and which are the free ones. Because a
page frame contains 32,768 bits, in 32-bit architectures the pidmap_array bitmap is stored in a single
page. In 64-bit architectures, however, additional pages can be added
to the bitmap when the kernel assigns a PID number too large for the
current bitmap size. These pages are never released.
Linux 为系统中的每个进程或轻量级进程关联一个不同的 PID。(正如我们将在本章后面看到的,多处理器系统上有一个微小的例外。)这种方法提供了最大的灵活性,因为系统中的每个执行上下文都可以被唯一地标识。
Linux associates a different PID with each process or lightweight process in the system. (As we shall see later in this chapter, there is a tiny exception on multiprocessor systems.) This approach allows the maximum flexibility, because every execution context in the system can be uniquely identified.
另一方面,Unix 程序员希望同一组中的线程具有共同的 PID。例如,应该可以发送指定影响组中所有线程的 PID 的信号。事实上,POSIX 1003.1c 标准规定多线程应用程序的所有线程必须具有相同的 PID。
On the other hand, Unix programmers expect threads in the same group to have a common PID. For instance, it should be possible to a send a signal specifying a PID that affects all threads in the group. In fact, the POSIX 1003.1c standard states that all threads of a multithreaded application must have the same PID.
为了遵守这个标准,Linux 使用了线程组。线程共享的标识符是线程组领导者的PID,即组中第一个轻量级进程的PID;它存储在tgid
进程描述符的字段中。这getpid(
) 系统调用返回相对于当前进程的值,tgid而不是 的值pid,因此多线程应用程序的所有线程共享相同的标识符。大多数进程属于由单个成员组成的线程组;作为线程组领导者,它们的tgid字段等于该pid字段,因此getpid( )系统调用对于此类进程照常工作。
To comply with this standard, Linux makes use of thread groups.
The identifier shared by the threads is the PID of the thread group
leader , that is, the PID of the first lightweight process in
the group; it is stored in the tgid
field of the process descriptors. The getpid(
) system call returns the value of tgid relative to the current process instead
of the value of pid, so all the
threads of a multithreaded application share the same identifier. Most
processes belong to a thread group consisting of a single member; as
thread group leaders, they have the tgid field equal to the pid field, thus the getpid( ) system call works as usual for
this kind of process.
稍后,我们将向您展示如何从其各自的 PID 有效地派生出真正的进程描述符指针。效率很重要,因为许多系统调用(例如kill( )使用 PID)来表示受影响的进程。
Later, we'll show you how it is possible to derive a true
process descriptor pointer efficiently from its respective PID.
Efficiency is important because many system calls such as kill( ) use the PID to denote the affected
process.
进程是动态实体,其生命周期从几毫秒到几个月不等。因此,内核必须能够同时处理许多进程,并且进程描述符存储在动态内存中而不是永久分配给内核的内存区域。对于每个进程,Linux 在单个进程内存区域中打包两种不同的数据结构:链接到进程描述符的小数据结构(即结构)thread_info和内核模式进程堆栈。该内存区域的长度通常为8,192字节(两个页框)。出于效率的考虑,内核将 8 KB 内存区域存储在两个连续的页帧中,第一个页帧与 2 13的倍数对齐;当可用动态内存很少时,这可能会成为一个问题,因为空闲内存可能会变得高度碎片化(请参阅“好友系统算法”部分)”(第 8 章)。因此,在 80×86 体系结构中,可以在编译时配置内核,以便包括堆栈和thread_info结构体的内存区域跨越单个页帧(4,096 字节)。
Processes are dynamic entities whose lifetimes range
from a few milliseconds to months. Thus, the kernel must be able to
handle many processes at the same time, and process descriptors are
stored in dynamic memory rather than in the memory area permanently assigned
to the kernel. For each process, Linux packs two different data
structures in a single per-process memory area: a small data
structure linked to the process descriptor, namely the thread_info structure, and the Kernel Mode
process stack. The length of this memory area is usually 8,192 bytes
(two page frames). For reasons of efficiency the kernel stores the
8-KB memory area in two consecutive page frames with the first page
frame aligned to a multiple of 213; this
may turn out to be a problem when little dynamic memory is
available, because the free memory may become highly fragmented (see
the section "The Buddy
System Algorithm" in Chapter 8). Therefore, in the
80×86 architecture the kernel can be configured at compilation time
so that the memory area including stack and thread_info structure spans a single page
frame (4,096 bytes).
在第2章“ Linux中的分段”
一节中,我们了解到内核模式下的进程访问包含在内核数据段中的堆栈,这与用户模式下的进程使用的堆栈不同。因为内核控制路径很少使用堆栈,只需要几千字节的内核堆栈。因此,8 KB 对于堆栈和thread_info
结构体来说是足够的空间。然而,当堆栈和thread_info结构体包含在单个页帧中时,内核使用一些额外的堆栈来避免深层嵌套的中断和异常导致的溢出(参见
第4章)。
In the section "Segmentation in Linux" in
Chapter 2, we learned that
a process in Kernel Mode accesses a stack contained in the kernel
data segment, which is different from the stack used by the process
in User Mode. Because kernel control paths make little use of the stack, only a few thousand
bytes of kernel stack are required. Therefore, 8 KB is ample space
for the stack and the thread_info
structure. However, when stack and thread_info structure are contained in a
single page frame, the kernel uses a few additional stacks to avoid
the overflows caused by deeply nested interrupts and exceptions (see
Chapter 4).
图 3-2
显示了这两个数据结构如何存储在 2 页(8 KB)内存区域中。该thread_info
结构驻留在内存区域的开头,堆栈从末尾向下增长。该图还表明thread_info结构体和task_struct结构体分别通过字段task和
相互链接thread_info。
Figure 3-2
shows how the two data structures are stored in the 2-page (8 KB)
memory area. The thread_info
structure resides at the beginning of the memory area, and the stack
grows downward from the end. The figure also shows that the thread_info structure and the task_struct structure are mutually linked
by means of the fields task and
thread_info, respectively.
图 3-2。将thread_info结构体和进程内核栈存放在两个页框中
Figure 3-2. Storing the thread_info structure and the process kernel stack in two page frames
该esp寄存器是CPU堆栈指针,用于寻址堆栈顶部位置。在 80×86 系统上,堆栈从末尾开始并向内存区域的开头增长。从用户态切换到内核态后,进程的内核堆栈始终为空,因此寄存器esp指向紧随堆栈之后的字节。
The esp register is the CPU
stack pointer, which is used to address the stack's top location. On
80×86 systems, the stack starts at the end and grows toward the
beginning of the memory area. Right after switching from User Mode
to Kernel Mode, the kernel stack of a process is always empty, and
therefore the esp register points
to the byte immediately following the stack.
esp一旦数据写入堆栈,的值就会减少。由于该
thread_info结构的长度为 52 字节,因此内核堆栈最多可以扩展到 8,140 字节。
The value of the esp is
decreased as soon as data is written into the stack. Because the
thread_info structure is 52 bytes
long, the kernel stack can expand up to 8,140 bytes.
C 语言允许thread_info通过以下联合结构方便地表示进程的结构和内核堆栈:
The C language allows the thread_info structure and the kernel stack
of a process to be conveniently represented by means of the
following union construct:
联合线程_联合{
struct thread_info 线程信息;
无符号长堆栈[2048];/* 4KB 堆栈为 1024 */
}; union thread_union {
struct thread_info thread_info;
unsigned long stack[2048]; /* 1024 for 4KB stacks */
};图3-2thread_info所示的结构是
从地址开始存储的,而堆栈是从地址开始存储的。寄存器的值指向当前堆栈的顶部。0x015fa0000x015fc000esp0x015fa878
The thread_info structure
shown in Figure 3-2
is stored starting at address 0x015fa000, and the stack is stored
starting at address 0x015fc000.
The value of the esp register
points to the current top of the stack at 0x015fa878.
内核使用alloc_thread_info和宏来分配和释放存储结构和内核堆栈的free_thread_info内存区域。thread_info
The kernel uses the alloc_thread_info and free_thread_info macros to allocate and
release the memory area storing a thread_info structure and a kernel
stack.
刚刚描述的结构和内核模式堆栈之间的紧密关联thread_info在效率方面提供了一个关键的好处:内核可以轻松地thread_info从寄存器的值获取当前在CPU上运行的进程的结构的地址esp。事实上,如果thread_union结构体长度为 8 KB(2 ×13字节),内核会屏蔽掉 13 个最低有效位esp以获得该结构体的基地址
thread_info;另一方面,如果thread_union
结构体长度为 4 KB,则内核会屏蔽 的 12 个最低有效位esp。这是由current_thread_info(
)函数,它生成如下汇编语言指令:
The close association between the thread_info structure and the Kernel Mode
stack just described offers a key benefit in terms of efficiency:
the kernel can easily obtain the address of the thread_info structure of the process
currently running on a CPU from the value of the esp register. In fact, if the thread_union structure is 8 KB
(213 bytes) long, the kernel masks out
the 13 least significant bits of esp to obtain the base address of the
thread_info structure; on the
other hand, if the thread_union
structure is 4 KB long, the kernel masks out the 12 least
significant bits of esp. This is
done by the current_thread_info(
) function, which produces assembly language instructions
like the following:
movl $0xffffe000,%ecx /* 或 0xfffff000 对于 4KB 堆栈 */
andl %esp,%ecx
movl %ecx,p movl $0xffffe000,%ecx /* or 0xfffff000 for 4KB stacks */
andl %esp,%ecx
movl %ecx,p执行完这三个指令后,p包含thread_info执行该指令的CPU上运行的进程的结构体指针。
After executing these three instructions, p contains the thread_info structure pointer of the
process running on the CPU that executes the instruction.
大多数情况下,内核需要进程描述符的地址而不是结构的地址thread_info。为了获取当前在 CPU 上运行的进程的进程描述符指针,内核使用宏current
,它本质上相当于current_thread_info( )->task并生成如下汇编语言指令:
Most often the kernel needs the address of the process
descriptor rather than the address of the thread_info structure. To get the process
descriptor pointer of the process currently running on a CPU, the
kernel makes use of the current
macro, which is essentially equivalent to current_thread_info( )->task and
produces assembly language instructions like the following:
movl $0xffffe000,%ecx /* 或 0xfffff000 对于 4KB 堆栈 */
andl %esp,%ecx
movl (%ecx),p movl $0xffffe000,%ecx /* or 0xfffff000 for 4KB stacks */
andl %esp,%ecx
movl (%ecx),p由于该task字段位于结构体中的偏移0处thread_info
,执行这三个指令后p包含了CPU上运行的进程的进程描述符指针。
Because the task field is
at offset 0 in the thread_info
structure, after executing these three instructions p contains the process descriptor pointer
of the process running on the CPU.
该current宏通常作为进程描述符字段的前缀出现在内核代码中。例如,current->pid返回CPU上当前运行的进程的进程ID。
The current macro often
appears in kernel code as a prefix to fields of the process
descriptor. For example, current->pid returns the process ID of
the process currently running on the CPU.
在多处理器系统上,使用堆栈存储进程描述符的另一个优点是:只需检查堆栈即可得出每个硬件处理器的正确当前进程,如前所述。早期版本的Linux没有将内核堆栈和进程描述符存储在一起。相反,他们被迫引入一个全局静态变量来current标识正在运行的进程的进程描述符。在多处理器系统上,有必要将current每个可用 CPU 定义为一个数组——一个元素。
Another advantage of storing the process descriptor with the
stack emerges on multiprocessor systems: the correct current process
for each hardware processor can be derived just by checking the
stack, as shown previously. Earlier versions of Linux did not store
the kernel stack and the process descriptor together. Instead, they
were forced to introduce a global static variable called current to identify the process descriptor
of the running process. On multiprocessor systems, it was necessary
to define current as an array—one
element for each available CPU.
在继续并描述内核如何跟踪系统中的各个进程之前,我们想强调实现双向链表的特殊数据结构的作用。
Before moving on and describing how the kernel keeps track of the various processes in the system, we would like to emphasize the role of special data structures that implement doubly linked lists.
对于每个列表,必须实现一组基本操作:初始化列表、插入和删除元素、扫描列表等。为每个不同的列表复制原始操作既会浪费程序员的精力,也会浪费内存。
For each list, a set of primitive operations must be implemented: initializing the list, inserting and deleting an element, scanning the list, and so on. It would be both a waste of programmers' efforts and a waste of memory to replicate the primitive operations for each different list.
因此,Linux内核定义了list_head数据结构,其唯一字段next和prev分别表示通用双向链表元素的前向和后向指针。然而,需要注意的是,list_head字段中的指针存储的是其他字段的地址list_head,而不是包含该
list_head结构的整个数据结构的地址;见图3-3
(a)。
Therefore, the Linux kernel defines the list_head data structure, whose only
fields next and prev represent the forward and back
pointers of a generic doubly linked list element, respectively. It
is important to note, however, that the pointers in a list_head field store the addresses of
other list_head fields rather
than the addresses of the whole data structures in which the
list_head structure is included;
see Figure 3-3
(a).
使用LIST_HEAD(list_name)宏创建一个新列表。list_name它声明了一个名为type 的新变量list_head,它是一个虚拟的第一个元素,充当新列表头部的占位符,并初始化数据结构的prev
和next字段list_head以指向list_name变量本身;见图3-3
(b)。
A new list is created by using the LIST_HEAD(list_name) macro. It declares a
new variable named list_name of
type list_head, which is a dummy
first element that acts as a placeholder for the head of the new
list, and initializes the prev
and next fields of the list_head data structure so as to point to
the list_name variable itself;
see Figure 3-3
(b).
几个函数和宏实现了这些原语,包括表3-1中所示的那些。
Several functions and macros implement the primitives, including those shown in Table Table 3-1.
表 3-1。列表处理函数和宏
Table 3-1. List handling functions and macros
姓名 Name | 描述 Description |
|---|---|
| 在 by 指向的指定元素后面插入一个
Inserts an element pointed to by
|
|
Inserts an element pointed to by
|
| 删除 指向的元素
Deletes an element pointed to by
|
|
Checks if the list specified by
the address |
| 返回包含具有名称和地址的 Returns the address of the data
structure of type |
|
Scans the elements of the list
specified by the address |
| 与 类似 Similar to |
Linux 内核 2.6 支持另一种双向链表,它与list_head链表的主要区别在于它不是循环的;它主要用于哈希表,其中空间很重要,而在常数时间内找到最后一个元素则不那么重要。列表头存储在一个hlist_head数据结构中,它只是指向列表中第一个元素的指针(NULL如果列表为空)。每个元素都由一个数据结构表示,其中包括指向下一个元素的hlist_node指针和指向前一个元素的字段的指针。因为列表不是循环的,所以第一个元素的字段和
nextpprevnextpprevnext最后一个元素的字段设置为NULL。该列表可以通过类似于表 3-1中列出的几个辅助函数和宏来处理:hlist_add_head( )、
hlist_del( )、hlist_empty( )、hlist_entry、hlist_for_each_entry等。
The Linux kernel 2.6 sports another kind of doubly linked
list, which mainly differs from a list_head list because it is not circular;
it is mainly used for hash tables, where space is important, and
finding the the last element in constant time is not. The list head
is stored in an hlist_head data
structure, which is simply a pointer to the first element in the
list (NULL if the list is empty).
Each element is represented by an hlist_node data structure, which includes
a pointer next to the next
element, and a pointer pprev to
the next field of the previous
element. Because the list is not circular, the pprev field of the first element and the
next field of the last element
are set to NULL. The list can be
handled by means of several helper functions and macros similar to
those listed in Table
3-1: hlist_add_head( ),
hlist_del( ), hlist_empty( ), hlist_entry, hlist_for_each_entry, and so on.
我们将检查的第一个双向链表示例是进程列表,它是一个将所有现有进程描述符链接在一起的列表。每个task_struct结构都包含一个tasks类型字段,list_head其prev和next字段分别指向前一个元素和下一个task_struct元素。
The first example of a doubly linked list we will
examine is the process list, a list that links
together all existing process descriptors. Each task_struct structure includes a tasks field of type list_head whose prev and next fields point, respectively, to the
previous and to the next task_struct element.
进程列表的头部是init_task task_struct描述符;它是所谓的进程 0或交换器的进程描述符(请参阅本章后面的“内核线程”部分)。字段指向列表中最后插入的进程描述符的字段tasks->prev。init_tasktasks
The head of the process list is the init_task task_struct descriptor; it is
the process descriptor of the so-called process
0 or swapper (see the section "Kernel Threads" later
in this chapter). The tasks->prev field of init_task points to the tasks field of the process descriptor
inserted last in the list.
和宏分别用于在进程列表中插入SET_LINKS和REMOVE_LINKS删除进程描述符。这些宏还负责处理进程的父关系(请参阅本章后面的“进程如何组织”部分)。
The SET_LINKS and REMOVE_LINKS macros are used to insert and
to remove a process descriptor in the process list, respectively.
These macros also take care of the parenthood relationship of the
process (see the section "How Processes Are
Organized" later in this chapter).
另一个有用的宏,称为for_each_process,扫描整个进程列表。它定义为:
Another useful macro, called for_each_process, scans the whole process
list. It is defined as:
#定义 for_each_process(p) \
for (p=&init_task; (p=list_entry((p)->tasks.next, \
结构task_struct,任务)\
) != &init_task; ) #define for_each_process(p) \
for (p=&init_task; (p=list_entry((p)->tasks.next, \
struct task_struct, tasks) \
) != &init_task; )该宏是循环控制语句,内核程序员在其后提供循环。注意init_task进程描述符如何扮演列表头的角色。宏首先移至下init_task一个任务,然后继续,直到init_task再次到达(由于列表的循环性)。在每次迭代中,作为宏参数传递的变量包含由宏返回的当前扫描的进程描述符的地址
list_entry。
The macro is the loop control statement after which the kernel
programmer supplies the loop. Notice how the init_task process descriptor just plays
the role of list header. The macro starts by moving past init_task to the next task and continues
until it reaches init_task again
(thanks to the circularity of the list). At each iteration, the
variable passed as the argument of the macro contains the address of
the currently scanned process descriptor, as returned by the
list_entry macro.
当寻找一个新的进程在CPU上运行时,内核必须只考虑可运行的进程(即处于该状态的进程TASK_RUNNING)。
When looking for a new process to run on a CPU, the kernel has
to consider only the runnable processes (that is, the processes in
the TASK_RUNNING state).
早期的 Linux 版本将所有可运行进程放在名为runqueue的同一个列表中。由于维护根据进程优先级排序的列表成本太高,因此早期的调度程序被迫扫描整个列表以选择“最佳”可运行进程。
Earlier Linux versions put all runnable processes in the same list called runqueue. Because it would be too costly to maintain the list ordered according to process priorities, the earlier schedulers were compelled to scan the whole list in order to select the "best" runnable process.
Linux 2.6 以不同的方式实现运行队列。目的是允许调度程序在恒定时间内选择最佳的可运行进程,而与可运行进程的数量无关。我们将在第 7 章中详细描述这种新型运行队列,并且这里仅提供一些基本信息。
Linux 2.6 implements the runqueue differently. The aim is to allow the scheduler to select the best runnable process in constant time, independently of the number of runnable processes. We'll defer to Chapter 7 a detailed description of this new kind of runqueue, and we'll provide here only some basic information.
用于实现调度程序加速的技巧包括将运行队列拆分为许多可运行进程列表,每个进程优先级一个列表。每个task_struct描述符都包含一个run_list类型字段list_head。如果进程优先级等于 k(范围在 0 到 139 之间的值),则该run_list字段将进程描述符链接到优先级为 k 的可运行进程列表中。此外,在多处理器系统上,每个 CPU 都有自己的运行队列,即自己的一组进程列表。这是一个通过使数据结构变得更加复杂来提高性能的经典示例:为了使调度程序操作更加高效,运行队列列表已被拆分为 140 个不同的列表!
The trick used to achieve the scheduler speedup consists of
splitting the runqueue in many lists of runnable processes, one list
per process priority. Each task_struct descriptor includes a run_list field of type list_head. If the process priority is
equal to k (a value ranging between 0 and 139), the run_list field links the process
descriptor into the list of runnable processes having priority k.
Furthermore, on a multiprocessor system, each CPU has its own
runqueue, that is, its own set of lists of processes. This is a
classic example of making a data structures more complex to improve
performance: to make scheduler operations more efficient, the
runqueue list has been split into 140 different lists!
正如我们将看到的,内核必须为系统中的每个运行队列保留大量数据;然而,运行队列的主要数据结构是属于运行队列的进程描述符列表;所有这些列表都由单个数据结构实现,其字段如表3-2prio_array_t所示。
As we'll see, the kernel must preserve a lot of data for every
runqueue in the system; however, the main data structures of a
runqueue are the lists of process descriptors belonging to the
runqueue; all these lists are implemented by a single prio_array_t data structure, whose fields
are shown in Table
3-2.
表 3-2。prio_array_t 数据结构的字段
Table 3-2. The fields of the prio_array_t data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 链接到列表中的进程描述符的数量 The number of process descriptors linked into the lists |
| | 优先级位图:当且仅当相应的优先级列表不为空时,每个标志才会被设置 A priority bitmap: each flag is set if and only if the corresponding priority list is not empty |
| | 优先名单中的 140 名负责人 The 140 heads of the priority lists |
该enqueue_task(p,array)
函数将进程描述符插入到运行队列列表中;它的代码本质上等同于:
The enqueue_task(p,array)
function inserts a process descriptor into a runqueue list; its code
is essentially equivalent to:
list_add_tail(&p->run_list, &array->queue[p->prio]);
__set_bit(p->prio, 数组->位图);
数组->nr_active++;
p->数组=数组; list_add_tail(&p->run_list, &array->queue[p->prio]);
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;进程描述符的字段prio存储了进程的动态优先级,而该array字段是指向prio_array_t其当前运行队列的数据结构的指针。类似地,该dequeue_task(p,array)函数从运行队列列表中删除进程描述符。
The prio field of the
process descriptor stores the dynamic priority of the process, while
the array field is a pointer to
the prio_array_t data structure
of its current runqueue. Similarly, the dequeue_task(p,array) function removes a
process descriptor from a runqueue list.
程序创建的进程具有父/子关系。当一个进程创建多个子进程时, 这些孩子有兄弟姐妹关系。进程描述符中必须引入几个字段来表示这些关系;对于给定的进程 P,它们在表 3-3中列出。进程 0 和 1 由内核创建;正如我们将在本章后面看到的,进程 1 ( init ) 是所有其他进程的祖先。
Processes created by a program have a parent/child relationship. When a process creates multiple children , these children have sibling relationships. Several fields must be introduced in a process descriptor to represent these relationships; they are listed in Table 3-3 with respect to a given process P. Processes 0 and 1 are created by the kernel; as we'll see later in the chapter, process 1 (init) is the ancestor of all other processes.
表 3-3。用于表达父关系的进程描述符的字段
Table 3-3. Fields of a process descriptor used to express parenthood relationships
字段名称 Field name | 描述 Description |
|---|---|
| 指向创建 P 的进程的进程描述符,如果父进程不再存在,则指向进程 1 ( init )的描述符。(因此,当用户启动后台进程并退出 shell 时,后台进程将成为init的子进程。) Points to the process descriptor of the process that created P or to the descriptor of process 1 (init) if the parent process no longer exists. (Therefore, when a user starts a background process and exits the shell, the background process becomes the child of init.) |
| 指向P的当前父进程(这是子进程终止时必须发出信号的进程);它的值通常与 的值一致
Points to the current parent of P
(this is the process that must be signaled when the child
process terminates); its value usually coincides with that of
|
| 包含 P 创建的所有子项的列表头。 The head of the list containing all children created by P. |
| 指向同级进程列表中下一个和上一个元素的指针,这些元素与 P 具有相同的父进程。 The pointers to the next and previous elements in the list of the sibling processes, those that have the same parent as P. |
图 3-4 说明了一组进程的父级和兄弟级关系。进程P0先后创建了P1、P2、P3。进程 P3 又创建了进程 P4。
Figure 3-4 illustrates the parent and sibling relationships of a group of processes. Process P0 successively created P1, P2, and P3. Process P3, in turn, created process P4.
此外,进程之间还存在其他关系:进程可以是进程组或登录会话的领导者(参见第 1 章“进程管理” ),它可以是线程组的领导者(参见“识别进程”) ”在本章前面),并且它还可以跟踪其他进程的执行(参见第20章中的“执行跟踪” 部分)。表 3-4列出了在进程 P 和其他进程之间建立这些关系的进程描述符的字段。
Furthermore, there exist other relationships among processes: a process can be a leader of a process group or of a login session (see "Process Management" in Chapter 1), it can be a leader of a thread group (see "Identifying a Process" earlier in this chapter), and it can also trace the execution of other processes (see the section "Execution Tracing" in Chapter 20). Table 3-4 lists the fields of the process descriptor that establish these relationships between a process P and the other processes.
表 3-4。建立非父关系的进程描述符字段
Table 3-4. The fields of the process descriptor that establish non-parenthood relationships
字段名称 Field name | 描述 Description |
|---|---|
| P的组长进程描述符指针 Process descriptor pointer of the group leader of P |
| P组长PID PID of the group leader of P |
tgid tgid | P的线程组组长PID PID of the thread group leader of P |
信号->会话 signal->session | P的登录会话领导者PID PID of the login session leader of P |
ptrace_children ptrace_children | 包含由调试器跟踪的 P 的所有子级的列表的头部 The head of a list containing all children of P being traced by a debugger |
ptrace_list ptrace_list | 指向真实父级跟踪进程列表中的下一个和上一个元素的指针(在跟踪 P 时使用) The pointers to the next and previous elements in the real parent's list of traced processes (used when P is being traced) |
在某些情况下,内核必须能够导出与 PID 相对应的进程描述符指针。例如,在服务kill( )系统调用时会发生这种情况。当进程 P1 希望向另一个进程 P2 发送信号时,它会调用
kill( )系统调用,并将 P2 的 PID 指定为参数。内核从PID中导出进程描述符指针,然后提取指向记录挂起信号的数据结构的指针来自 P2 的进程描述符。
In several circumstances, the kernel must be able to
derive the process descriptor pointer corresponding to a PID. This
occurs, for instance, in servicing the kill( ) system call. When process P1
wishes to send a signal to another process, P2, it invokes the
kill( ) system call specifying
the PID of P2 as the parameter. The kernel derives the process
descriptor pointer from the PID and then extracts the pointer to the
data structure that records the pending signals from P2's process descriptor.
顺序扫描进程列表并检查
pid进程描述符的字段是可行的,但效率较低。为了加快搜索速度,引入了四个哈希表。为什么需要多个哈希表?很简单,因为进程描述符包含代表不同类型PID的字段(参见表3-5),并且每种类型的PID都需要自己的哈希表。
Scanning the process list sequentially and checking the
pid fields of the process
descriptors is feasible but rather inefficient. To speed up the
search, four hash tables have been introduced. Why multiple hash
tables? Simply because the process descriptor includes fields that
represent different types of PID (see Table 3-5), and each
type of PID requires its own hash table.
表 3-5。进程描述符中的四个哈希表和相应字段
Table 3-5. The four hash tables and corresponding fields in the process descriptor
哈希表类型 Hash table type | 字段名称 Field name | 描述 Description |
|---|---|---|
| | 进程PID PID of the process |
| | 线程组领导进程PID PID of thread group leader process |
| | 组长进程PID PID of the group leader process |
| | 会话领导者进程的PID PID of the session leader process |
这四个哈希表是在内核初始化阶段动态分配的,它们的地址存储在数组中
pid_hash。单个哈希表的大小取决于可用 RAM 的大小;例如,对于具有 512 MB RAM 的系统,每个哈希表存储在四个页框中,并包括 2,048 个条目。
The four hash tables are dynamically allocated during the
kernel initialization phase, and their addresses are stored in the
pid_hash array. The size of a
single hash table depends on the amount of available RAM; for
example, for systems having 512 MB of RAM, each hash table is stored
in four page frames and includes 2,048 entries.
使用宏将 PID 转换为表索引pid_hashfn,该宏扩展为:
The PID is transformed into a table index using the pid_hashfn macro, which expands to:
#define pid_hashfn(x) hash_long((无符号长整型) x, pidhash_shift)
#define pid_hashfn(x) hash_long((unsigned long) x, pidhash_shift)
该pidhash_shift变量存储表索引的长度(以位为单位)(在我们的示例中为 11)。该
hash_long( )函数被许多哈希函数使用;在 32 位架构上,它本质上相当于:
The pidhash_shift variable
stores the length in bits of a table index (11, in our example). The
hash_long( ) function is used by
many hash functions; on a 32-bit architecture it is essentially
equivalent to:
unsigned long hash_long(unsigned long val, unsigned int 位)
{
无符号长哈希 = val * 0x9e370001UL;
返回散列>>(32位);
} unsigned long hash_long(unsigned long val, unsigned int bits)
{
unsigned long hash = val * 0x9e370001UL;
return hash >> (32 - bits);
}因为在我们的示例中pidhash_shift等于 11,pid_hashfn所以产生的值范围在 0 和 2 11 − 1 = 2047 之间。
Because in our example pidhash_shift is equal to 11, pid_hashfn yields values ranging between 0
and 211 − 1 = 2047.
正如每门基础计算机科学课程所解释的那样,哈希函数并不总是确保 PID 和表索引之间的一一对应。散列到同一表索引中的两个不同 PID 被称为冲突。
As every basic computer science course explains, a hash function does not always ensure a one-to-one correspondence between PIDs and table indexes. Two different PIDs that hash into the same table index are said to be colliding.
Linux 使用链接来处理 PID 冲突;每个表条目都是冲突进程描述符双向链表的头部。图 3-5说明了具有两个列表的 PID 哈希表。PID 为 2,890 和 29,384 的进程哈希到表的第 200 个元素,而 PID 为 29,385 的进程哈希到 表的第 1,466个元素。
Linux uses chaining to handle colliding PIDs; each table entry is the head of a doubly linked list of colliding process descriptors. Figure 3-5 illustrates a PID hash table with two lists. The processes having PIDs 2,890 and 29,384 hash into the 200th element of the table, while the process having PID 29,385 hashes into the 1,466th element of the table.
使用链接进行散列优于从 PID 到表索引的线性转换,因为在任何给定实例中,系统中的进程数通常远低于 32,768(允许的 PID 的最大数量)。如果在任何给定实例中大多数此类条目都未使用,那么定义由 32,768 个条目组成的表将浪费存储空间。
Hashing with chaining is preferable to a linear transformation from PIDs to table indexes because at any given instance, the number of processes in the system is usually far below 32,768 (the maximum number of allowed PIDs). It would be a waste of storage to define a table consisting of 32,768 entries, if, at any given instance, most such entries are unused.
PID 哈希表中使用的数据结构非常复杂,因为它们必须跟踪进程之间的关系。举个例子,假设内核必须检索属于给定线程组的所有进程,即tgid字段等于给定数字的所有进程。在哈希表中查找给定的线程组编号仅返回一个进程描述符,即线程组领导者的描述符。为了快速检索组中的其他进程,内核必须为每个线程组维护一个进程列表。当查找属于给定登录会话或属于给定进程组的进程时,也会出现相同的情况。
The data structures used in the PID hash tables are quite
sophisticated, because they must keep track of the relationships
between the processes. As an example, suppose that the kernel must
retrieve all processes belonging to a given thread group, that is,
all processes whose tgid field is
equal to a given number. Looking in the hash table for the given
thread group number returns just one process descriptor, that is,
the descriptor of the thread group leader. To quickly retrieve the
other processes in the group, the kernel must maintain a list of
processes for each thread group. The same situation arises when
looking for the processes belonging to a given login session or
belonging to a given process group.
PID 哈希表的数据结构解决了所有这些问题,因为它们允许为哈希表中包含的任何 PID 号定义进程列表。核心数据结构是pid嵌入在pids进程描述符字段中的四个结构体的数组;该结构体的字段如表3-6pid
所示。
The PID hash tables' data structures solve all these problems,
because they allow the definition of a list of processes for any PID
number included in a hash table. The core data structure is an array
of four pid structures embedded
in the pids field of the process
descriptor; the fields of the pid
structure are shown in Table 3-6.
表 3-6。pid数据结构的字段
Table 3-6. The fields of the pid data structures
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
整数 int | | PID号 The PID number |
结构体hlist_node struct hlist_node | | 哈希链列表中下一个和上一个元素的链接 The links to the next and previous elements in the hash chain list |
结构列表头 struct list_head | | 每个 PID 列表的头部 The head of the per-PID list |
图3-6
显示了基于哈希表的示例PIDTYPE_TGID。数组的第二个条目pid_hash存储哈希表的地址,即hlist_head表示链表头的结构体数组。在以哈希表的第71个条目为根的链表中,有两个进程描述符对应于PID号246和4,351(双箭头线代表一对前向和后向指针)。PID号存储在进程描述符中嵌入的结构nr体字段pid中(顺便说一下,因为线程组号与其领导者的PID一致,所以这些数字也存储在进程描述符中)
pid进程描述符的字段)。让我们考虑线程组 4,351 的每个 PID 列表:列表的头部存储在pid_list哈希表中包含的进程描述符的字段中,而到每个 PID 列表的下一个和上一个元素的链接也存储在哈希表中。存储在pid_list每个列表元素的字段中。
Figure 3-6
shows an example based on the PIDTYPE_TGID hash table. The second entry
of the pid_hash array stores the
address of the hash table, that is, the array of hlist_head structures representing the
heads of the chain lists. In the chain list rooted at the
71st entry of the hash table, there are
two process descriptors corresponding to the PID numbers 246 and
4,351 (double-arrow lines represent a couple of forward and backward
pointers). The PID numbers are stored in the nr field of the pid structure embedded in the process
descriptor (by the way, because the thread group number coincides
with the PID of its leader, these numbers also are stored in the
pid field of the process
descriptors). Let us consider the per-PID list of the thread group
4,351: the head of the list is stored in the pid_list field of the process descriptor
included in the hash table, while the links to the next and previous
elements of the per-PID list also are stored in the pid_list field of each list
element.
以下函数和宏用于处理 PID 哈希表:
The following functions and macros are used to handle the PID hash tables:
do_each_task_pid(nr, type,
task)do_each_task_pid(nr, type,
task)while_each_task_pid(nr, type,
task)while_each_task_pid(nr, type,
task)标记 do-while 循环的开始和结束,该循环遍历与类型 PID 编号关联的每个 PIDnr列表type;在任何迭代中,task指向当前扫描元素的进程描述符。
Mark begin and end of a do-while loop that iterates over
the per-PID list associated with the PID number nr of type type; in any iteration, task points to the process
descriptor of the currently scanned element.
find_task_by_pid_type(type,
nr)find_task_by_pid_type(type,
nr)nr在类型为 的哈希表中
查找具有 PID 的进程type。如果找到匹配,该函数将返回进程描述符指针,否则返回NULL。
Looks for the process having PID nr in the hash table of type
type. The function returns
a process descriptor pointer if a match is found, otherwise it
returns NULL.
find_task_by_pid(nr)find_task_by_pid(nr)与 相同find_task_by_pid_type(PIDTYPE_PID,
nr)。
Same as find_task_by_pid_type(PIDTYPE_PID,
nr).
attach_pid(task, type,
nr)attach_pid(task, type,
nr)根据PID号将 指向的进程描述符插入到task类型为PID的哈希表中
;如果具有 PID 的进程描述符已存在于哈希表中,则该函数只需插入
已存在进程的每个 PID 列表中。typenrnrtask
Inserts the process descriptor pointed to by task in the PID hash table of type
type according to the PID
number nr; if a process
descriptor having PID nr is
already in the hash table, the function simply inserts
task in the per-PID list of
the already present process.
detach_pid(task,
type)detach_pid(task,
type)从描述符所属task类型的每个 PID 列表中
删除 指向的进程描述符。type如果每个 PID 列表没有变空,则该函数终止。否则,该函数从类型为 的哈希表中删除进程描述符type;最后,如果该PID号没有出现在任何其他哈希表中,该函数将清除PID位图中的相应位,以便该编号可以被回收。
Removes the process descriptor pointed to by task from the per-PID list of type
type to which the
descriptor belongs. If the per-PID list does not become empty,
the function terminates. Otherwise, the function removes the
process descriptor from the hash table of type type; finally, if the PID number
does not occur in any other hash table, the function clears
the corresponding bit in the PID bitmap, so that the number
can be recycled.
next_thread(task)next_thread(task)task返回类型为 的哈希表 list
后面的轻量级进程的进程描述符地址PIDTYPE_TGID。因为哈希表列表是循环的,所以当应用于传统进程时,宏返回进程本身的描述符地址。
Returns the process descriptor address of the
lightweight process that follows task in the hash table list of type
PIDTYPE_TGID. Because the
hash table list is circular, when applied to a conventional
process the macro returns the descriptor address of the
process itself.
运行队列列出了处于某种TASK_RUNNING状态的所有进程。当谈到对其他状态下的进程进行分组时,不同的状态需要不同类型的处理,Linux 选择以下列表中显示的选项之一。
The runqueue lists group all processes in a TASK_RUNNING state. When it comes to
grouping processes in other states, the various states call for
different types of treatment, with Linux opting for one of the choices
shown in the following list.
TASK_STOPPED处于、EXIT_ZOMBIE或状态的进程EXIT_DEAD未链接到特定列表中。无需对处于这三种状态中任何一种状态的进程进行分组,因为只能通过 PID 或特定父进程的子进程的链表来访问停止、僵尸和死亡进程。
Processes in a TASK_STOPPED, EXIT_ZOMBIE, or EXIT_DEAD state are not linked in
specific lists. There is no need to group processes in any of
these three states, because stopped, zombie, and dead processes
are accessed only via PID or via linked lists of the child
processes for a particular parent.
处于aTASK_INTERRUPTIBLE或TASK_UNINTERRUPTIBLE状态的进程被细分为许多类,每一类对应于一个特定的事件。在这种情况下,进程状态无法提供足够的信息来快速检索进程,因此有必要引入额外的进程列表。这些称为等待队列 并在接下来讨论。
Processes in a TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state are
subdivided into many classes, each of which corresponds to a
specific event. In this case, the process state does not provide
enough information to retrieve the process quickly, so it is
necessary to introduce additional lists of processes. These are
called wait queues and are discussed next.
等待队列在内核中有多种用途,特别是用于中断处理、进程同步和计时。因为这些主题将在后面的章节中讨论,所以我们在这里只说进程必须经常等待某些事件的发生,例如磁盘操作终止、系统资源释放或固定的时间间隔发生。过去。等待队列实现对事件的条件等待:希望等待特定事件的进程将其自身放入适当的等待队列并放弃控制。因此,等待队列代表一组休眠进程,当某些条件成立时,它们会被内核唤醒。
Wait queues have several uses in the kernel, particularly for interrupt handling, process synchronization, and timing. Because these topics are discussed in later chapters, we'll just say here that a process must often wait for some event to occur, such as for a disk operation to terminate, a system resource to be released, or a fixed interval of time to elapse. Wait queues implement conditional waits on events: a process wishing to wait for a specific event places itself in the proper wait queue and relinquishes control. Therefore, a wait queue represents a set of sleeping processes, which are woken up by the kernel when some condition becomes true.
等待队列被实现为双向链表,其元素包括指向进程描述符的指针。每个等待队列由等待队列头标识,这是一个类型为的数据结构wait_queue_head_t:
Wait queues are implemented as doubly linked lists whose
elements include pointers to process descriptors. Each wait queue is
identified by a wait queue head, a data
structure of type wait_queue_head_t:
结构 _ _wait_queue_head {
spinlock_t 锁;
struct list_head 任务列表;
};
typedef struct __wait_queue_head wait_queue_head_t; struct _ _wait_queue_head {
spinlock_t lock;
struct list_head task_list;
};
typedef struct _ _wait_queue_head wait_queue_head_t;由于等待队列由中断处理程序以及主要内核函数修改,因此必须保护双向链表免受并发访问,这可能会导致不可预测的结果(请参阅第 5 章)。同步是通过lock等待队列头的自旋锁来实现的。该
task_list字段是等待进程列表的头部。
Because wait queues are modified by interrupt handlers as well
as by major kernel functions, the doubly linked lists must be
protected from concurrent accesses, which could induce unpredictable
results (see Chapter 5).
Synchronization is achieved by the lock spin lock in the wait queue head. The
task_list field is the head of
the list of waiting processes.
等待队列列表的元素类型为wait_queue_t:
Elements of a wait queue list are of type wait_queue_t:
结构 __wait_queue {
无符号整型标志;
结构任务结构*任务;
wait_queue_func_t函数;
struct list_head 任务列表;
};
typedef struct __wait_queue wait_queue_t; struct _ _wait_queue {
unsigned int flags;
struct task_struct * task;
wait_queue_func_t func;
struct list_head task_list;
};
typedef struct _ _wait_queue wait_queue_t;等待队列列表中的每个元素代表一个正在休眠的进程,正在等待某个事件发生;它的描述符地址存储在该task字段中。该task_list字段包含将该元素链接到等待同一事件的进程列表的指针。
Each element in the wait queue list represents a
sleeping process, which is waiting for some event to occur;
its descriptor address is stored in the task field. The task_list field contains the pointers that
link this element to the list of processes waiting for the same
event.
然而,唤醒 所有休眠进程并不总是很方便在等待队列中。例如,如果两个或多个进程正在等待对某些要释放的资源的独占访问,则仅唤醒等待队列中的一个进程是有意义的。该进程占用资源,而其他进程继续休眠。(这避免了被称为“惊群”的问题,即唤醒多个进程只是为了争夺其中一个进程可以访问的资源,其结果是剩余进程必须再次重新进入睡眠状态。)
However, it is not always convenient to wake up all sleeping processes in a wait queue. For instance, if two or more processes are waiting for exclusive access to some resource to be released, it makes sense to wake up just one process in the wait queue. This process takes the resource, while the other processes continue to sleep. (This avoids a problem known as the "thundering herd," with which multiple processes are wakened only to race for a resource that can be accessed by one of them, with the result that remaining processes must once more be put back to sleep.)
因此,有两种睡眠进程:
独占进程 (由相应等待队列元素字段中的值 1 表示flags)被内核选择性地唤醒,而非
独占进程 (由字段中的值 0 表示flags)在事件发生时总是被内核唤醒。等待一次只能授予一个进程的资源的进程是典型的独占进程。等待可能涉及其中任何一个事件的进程是非排他的。例如,考虑一组进程正在等待一组磁盘块传输的终止:一旦传输完成,所有等待进程都必须被唤醒。正如我们接下来将看到的,func等待队列元素的字段用于指定如何唤醒等待队列中休眠的进程。
Thus, there are two kinds of sleeping processes:
exclusive processes (denoted by the value 1 in the flags field of the corresponding wait
queue element) are selectively woken up by the kernel, while
nonexclusive processes (denoted by the value 0 in the flags field) are always woken up by the
kernel when the event occurs. A process waiting for a resource that
can be granted to just one process at a time is a typical exclusive
process. Processes waiting for an event that may concern any of them
are nonexclusive. Consider, for instance, a group of processes that
are waiting for the termination of a group of disk block transfers:
as soon as the transfers complete, all waiting processes must be
woken up. As we'll see next, the func field of a wait queue element is used
to specify how the processes sleeping in the wait queue should be
woken up.
可以使用该
DECLARE_WAIT_QUEUE_HEAD(name)
宏定义新的等待队列头,该宏静态声明一个新的等待队列头变量调用name并初始化其
lock和task_list字段。该init_waitqueue_head( )函数可用于初始化动态分配的等待队列头变量。
A new wait queue head may be defined by using the
DECLARE_WAIT_QUEUE_HEAD(name)
macro, which statically declares a new wait queue head variable
called name and initializes its
lock and task_list fields. The init_waitqueue_head( ) function may be
used to initialize a wait queue head variable that was allocated
dynamically.
该init_waitqueue_entry(q,p
)函数初始化一个wait_queue_t结构体q如下:
The init_waitqueue_entry(q,p
) function initializes a wait_queue_t structure q as follows:
q-> 标志 = 0;
q->任务=p;
q->func = default_wake_function; q->flags = 0;
q->task = p;
q->func = default_wake_function;非独占进程p
将被 唤醒,它是第 7 章中讨论的函数
default_wake_function(
)的简单包装器。try_to_wake_up( )
The nonexclusive process p
will be awakened by default_wake_function(
), which is a simple wrapper for the try_to_wake_up( ) function discussed in
Chapter 7.
或者,DEFINE_WAIT宏声明一个新wait_queue_t变量,并使用当前在 CPU 上执行的进程的描述符和唤醒autoremove_wake_function( )函数的地址对其进行初始化。调用该函数default_wake_function( )唤醒休眠进程,然后从等待队列列表中删除等待队列元素。最后,内核开发人员可以通过使用函数初始化等待队列元素来定义自定义唤醒函数
init_waitqueue_func_entry( )
。
Alternatively, the DEFINE_WAIT macro declares a new wait_queue_t variable and initializes it
with the descriptor of the process currently executing on the CPU
and the address of the autoremove_wake_function( ) wake-up
function. This function invokes default_wake_function( ) to awaken the
sleeping process, and then removes the wait queue element from the
wait queue list. Finally, a kernel developer can define a custom
awakening function by initializing the wait queue element with the
init_waitqueue_func_entry( )
function.
一旦定义了一个元素,就必须将其插入到等待队列中。该add_wait_queue( )
函数在等待队列列表的第一个位置插入一个非独占进程。该add_wait_queue_exclusive( )函数在等待队列列表的最后位置插入一个独占进程。该remove_wait_queue( )
函数从等待队列列表中删除进程。该waitqueue_active( )函数检查给定的等待队列列表是否为空。
Once an element is defined, it must be inserted into a wait
queue. The add_wait_queue( )
function inserts a nonexclusive process in the first position of a
wait queue list. The add_wait_queue_exclusive( ) function
inserts an exclusive process in the last position of a wait queue
list. The remove_wait_queue( )
function removes a process from a wait queue list. The waitqueue_active( ) function checks
whether a given wait queue list is empty.
希望等待特定条件的进程可以调用以下列表中显示的任何函数。
A process wishing to wait for a specific condition can invoke any of the functions shown in the following list.
该sleep_on( )
函数对当前进程进行操作:
无效 sleep_on(wait_queue_head_t *wq)
{
wait_queue_t等待;
init_waitqueue_entry(&等待,当前);
当前->状态 = TASK_UNINTERRUPTIBLE;
add_wait_queue(wq,&等待); /* wq 指向等待队列头 */
日程( );
删除_等待_队列(wq,&等待);
}该函数将当前进程的状态设置为
TASK_UNINTERRUPTIBLE并将其插入到指定的等待队列中。然后它调用调度程序,该调度程序恢复另一个进程的执行。当睡眠进程被唤醒时,调度程序恢复执行该sleep_on( )
函数,从而将该进程从等待队列中删除。
The sleep_on( )
function operates on the current process:
void sleep_on(wait_queue_head_t *wq)
{
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_UNINTERRUPTIBLE;
add_wait_queue(wq,&wait); /* wq points to the wait queue head */
schedule( );
remove_wait_queue(wq, &wait);
}The function sets the state of the current process to
TASK_UNINTERRUPTIBLE and
inserts it into the specified wait queue. Then it invokes the
scheduler, which resumes the execution of another process. When
the sleeping process is awakened, the scheduler resumes
execution of the sleep_on( )
function, which removes the process from the wait queue.
该interruptible_sleep_on(
)功能与 相同sleep_on( ),只是将当前进程的状态设置为 ,TASK_INTERRUPTIBLE而不是设置为TASK_UNINTERRUPTIBLE,这样进程也可以通过接收信号来唤醒。
The interruptible_sleep_on(
) function is identical to sleep_on( ), except that it sets the
state of the current process to TASK_INTERRUPTIBLE instead of setting
it to TASK_UNINTERRUPTIBLE,
so that the process also can be woken up by receiving a
signal.
和函数与前面的函数类似,但它们还允许调用者定义一个时间间隔,在此sleep_on_timeout( )
间隔interruptible_sleep_on_timeout(
)之后进程将被内核唤醒。为此,他们调用该schedule_timeout( )
函数而不是(参见第 6 章中的“动态定时器的应用:nanosleep() 系统调用”schedule(
)部分)。
The sleep_on_timeout( )
and interruptible_sleep_on_timeout(
) functions are similar to the previous ones, but they
also allow the caller to define a time interval after which the
process will be woken up by the kernel. To do this, they invoke
the schedule_timeout( )
function instead of schedule(
) (see the section "An Application of Dynamic
Timers: the nanosleep( ) System Call" in Chapter 6).
Linux 2.6 中引入的prepare_to_wait( )、
prepare_to_wait_exclusive( )、 和finish_wait( )函数提供了另一种使当前进程在等待队列中休眠的方法。通常,它们的使用方式如下:
DEFINE_WAIT(等待);
准备_等待_独占(&wq,&等待,TASK_INTERRUPTIBLE);
/* wq是等待队列的头*/
...
if (!条件)
日程();
finish_wait(&wq, &等待);和函数将进程状态设置为作为第三个参数传递的值,然后将等待队列元素中的独占标志分别设置为prepare_to_wait( )
0 prepare_to_wait_exclusive(
)(非独占)或1(独占),最后将等待队列元素插入wait到等待队列头wq。
一旦进程被唤醒,它就会执行该
finish_wait( )函数,该函数再次将进程状态设置为TASK_RUNNING(以防万一在调用之前唤醒条件变为真schedule( )),并从等待队列列表中删除等待队列元素(除非它已经被删除)。由唤醒功能完成)。
The prepare_to_wait( ),
prepare_to_wait_exclusive( ),
and finish_wait( ) functions,
introduced in Linux 2.6, offer yet another way to put the
current process to sleep in a wait queue. Typically, they are
used as follows:
DEFINE_WAIT(wait);
prepare_to_wait_exclusive(&wq, &wait, TASK_INTERRUPTIBLE);
/* wq is the head of the wait queue */
...
if (!condition)
schedule();
finish_wait(&wq, &wait);The prepare_to_wait( )
and prepare_to_wait_exclusive(
) functions set the process state to the value passed
as the third parameter, then set the exclusive flag in the wait
queue element respectively to 0 (nonexclusive) or 1 (exclusive),
and finally insert the wait queue element wait into the list of the wait queue
head wq.
As soon as the process is awakened, it executes the
finish_wait( ) function,
which sets again the process state to TASK_RUNNING (just in case the awaking
condition becomes true before invoking schedule( )), and removes the wait
queue element from the wait queue list (unless this has already
been done by the wake-up function).
和
宏使调用进程在等待队列中休眠,直到验证给定条件wait_event。
wait_event_interruptible例如,该wait_event(wq,condition)宏本质上产生以下片段:
DEFINE_WAIT(__等待);
为了 (;;) {
准备等待(&wq, &_ _wait, TASK_UNINTERRUPTIBLE);
如果(条件)
休息;
日程( );
}
finish_wait(&wq, &_ _wait);The wait_event and
wait_event_interruptible
macros put the calling process to sleep on a wait queue until a
given condition is verified. For instance, the wait_event(wq,condition) macro
essentially yields the following fragment:
DEFINE_WAIT(_ _wait);
for (;;) {
prepare_to_wait(&wq, &_ _wait, TASK_UNINTERRUPTIBLE);
if (condition)
break;
schedule( );
}
finish_wait(&wq, &_ _wait);对上面列表中提到的函数的一些评论:sleep_on( )类似函数不能在常见情况下使用,在这种情况下,必须测试条件并在条件未得到验证时自动使进程进入睡眠状态;因此,由于它们是众所周知的竞争条件来源,因此不鼓励使用它们。此外,为了将独占进程插入等待队列,内核必须使用该prepare_to_wait_exclusive(
)函数(或者add_wait_queue_exclusive( )直接调用);任何其他辅助函数都会将进程作为非独占插入。最后,除非使用DEFINE_WAIT或finish_wait( ),否则内核必须在等待进程被唤醒后从列表中删除等待队列元素。
A few comments on the functions mentioned in the above list:
the sleep_on( )-like functions
cannot be used in the common situation where one has to test a
condition and atomically put the process to sleep when the condition
is not verified; therefore, because they are a well-known source of
race conditions, their use is discouraged. Moreover, in order to
insert an exclusive process into a wait queue, the kernel must make
use of the prepare_to_wait_exclusive(
) function (or just invoke add_wait_queue_exclusive( ) directly); any
other helper function inserts the process as nonexclusive. Finally,
unless DEFINE_WAIT or finish_wait( ) are used, the kernel must
remove the wait queue element from the list after the waiting
process has been awakened.
内核TASK_RUNNING通过以下宏之一唤醒等待队列中的进程,将它们置于该状态:wake_up、wake_up_nr、wake_up_all、wake_up_interruptible、wake_up_interruptible_nr、wake_up_interruptible_all、wake_up_interruptible_sync、 和wake_up_locked。从名字就可以明白这九个宏的作用:
The kernel awakens processes in the wait queues, putting them
in the TASK_RUNNING state, by
means of one of the following macros: wake_up, wake_up_nr, wake_up_all, wake_up_interruptible, wake_up_interruptible_nr, wake_up_interruptible_all, wake_up_interruptible_sync, and wake_up_locked. One can understand what
each of these nine macros does from its name:
所有宏都考虑该TASK_INTERRUPTIBLE状态下的休眠进程;如果宏名称不包含字符串“interruptible”,TASK_UNINTERRUPTIBLE则也会考虑处于该状态的休眠进程。
All macros take into consideration sleeping processes in
the TASK_INTERRUPTIBLE state;
if the macro name does not include the string "interruptible,"
sleeping processes in the TASK_UNINTERRUPTIBLE state also are
considered.
所有宏都会唤醒具有所需状态的所有非独占进程(请参阅前面的项目符号项)。
All macros wake all nonexclusive processes having the required state (see the previous bullet item).
名称中包含字符串“nr”的宏会唤醒给定数量的具有所需状态的独占进程;这个数字是宏的一个参数。名称中包含字符串“all”的宏唤醒具有所需状态的所有独占进程。最后,名称不包含“nr”或“all”的宏恰好唤醒一个具有所需状态的独占进程。
The macros whose name include the string "nr" wake a given number of exclusive processes having the required state; this number is a parameter of the macro. The macros whose names include the string "all" wake all exclusive processes having the required state. Finally, the macros whose names don't include "nr" or "all" wake exactly one exclusive process that has the required state.
名称中不包含字符串“sync”的宏检查任何已唤醒进程的优先级是否高于系统中当前运行的进程的优先级,并在必要时调用schedule(
)。这些检查不是由名称中包含字符串“sync”的宏执行的;因此,高优先级进程的执行可能会稍微延迟。
The macros whose names don't include the string "sync"
check whether the priority of any of the woken processes is
higher than that of the processes currently running in the
systems and invoke schedule(
) if necessary. These checks are not made by the macro
whose name includes the string "sync"; as a result, execution of
a high priority process might be slightly delayed.
该wake_up_locked
宏与 类似,不同之处在于它是在已经持有wake_up自旋锁时调用的。wait_queue_head_t
The wake_up_locked
macro is similar to wake_up,
except that it is called when the spin lock in wait_queue_head_t is already
held.
例如,该wake_up
宏本质上等同于以下代码片段:
For instance, the wake_up
macro is essentially equivalent to the following code
fragment:
无效唤醒(wait_queue_head_t *q)
{
结构list_head *tmp;
wait_queue_t *curr;
list_for_each(tmp, &q->task_list) {
curr = list_entry(tmp, wait_queue_t, task_list);
if (curr->func(curr, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE,
0, NULL) && 当前->标志)
休息;
}
} void wake_up(wait_queue_head_t *q)
{
struct list_head *tmp;
wait_queue_t *curr;
list_for_each(tmp, &q->task_list) {
curr = list_entry(tmp, wait_queue_t, task_list);
if (curr->func(curr, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE,
0, NULL) && curr->flags)
break;
}
}该list_for_each宏扫描双向链表中的所有项目q->task_list,即等待队列中的所有进程。对于每一项,list_entry宏计算相应变量的地址wait_queue_t
。该变量的字段func存储唤醒函数的地址,该函数尝试唤醒由task等待队列元素的字段标识的进程。如果一个进程已被有效唤醒(函数返回1)并且该进程是否是独占的(curr->flags等于 1),循环终止。由于所有非独占进程始终位于双向链表的开头,而所有独占进程位于末尾,因此该函数始终唤醒非独占进程,然后唤醒一个独占进程(如果存在)。[ * ]
The list_for_each macro
scans all items in the q->task_list doubly linked list, that
is, all processes in the wait queue. For each item, the list_entry macro computes the address of
the corresponding wait_queue_t
variable. The func field of this
variable stores the address of the wake-up function, which tries to
wake up the process identified by the task field of the wait queue element. If a
process has been effectively awakened (the function returned 1) and
if the process is exclusive (curr->flags equal to 1), the loop
terminates. Because all nonexclusive processes are always at the
beginning of the doubly linked list and all exclusive processes are
at the end, the function always wakes the nonexclusive processes and
then wakes one exclusive process, if any exists.[*]
每个进程都有一组关联的资源限制 ,它指定它可以使用的系统资源量。这些限制可以防止用户压垮系统(其 CPU、磁盘空间等)。Linux 识别表 3-7中所示的以下资源限制。
Each process has an associated set of resource limits , which specify the amount of system resources it can use. These limits keep a user from overwhelming the system (its CPU, disk space, and so on). Linux recognizes the following resource limits illustrated in Table 3-7.
当前进程的资源限制存储在该
current->signal->rlim字段中,即进程的信号描述符的字段中(参见第11章中的“与信号相关的数据结构”部分)。该字段是类型为 的元素数组,每个元素对应一个资源限制:struct
rlimit
The resource limits for the current process are stored in the
current->signal->rlim field,
that is, in a field of the process's signal descriptor (see the
section "Data Structures
Associated with Signals" in Chapter 11). The field is an
array of elements of type struct
rlimit, one for each resource limit:
结构体限制{
无符号长 rlim_cur;
无符号长 rlim_max;
}; struct rlimit {
unsigned long rlim_cur;
unsigned long rlim_max;
};表 3-7。资源限制
Table 3-7. Resource limits
字段名称 Field name | 描述 Description |
|---|---|
| 进程地址空间的最大大小(以字节为单位)。当进程使用 The maximum size of process address
space, in bytes. The kernel checks this value when the process
uses |
| 最大核心转储文件大小(以字节为单位)。当进程中止时,内核会在进程的当前目录中创建核心文件之前检查此值(请参阅第11 章中的“传递信号时执行的操作”部分)。如果限制为 0,内核将不会创建该文件。 The maximum core dump file size, in bytes. The kernel checks this value when a process is aborted, before creating a core file in the current directory of the process (see the section "Actions Performed upon Delivering a Signal" in Chapter 11). If the limit is 0, the kernel won't create the file. |
| 进程的最大 CPU 时间(以秒为单位)。如果进程超出限制,内核会向其发送一个 The maximum CPU time for the
process, in seconds. If the process exceeds the limit, the
kernel sends it a |
| 最大堆大小(以字节为单位)。内核在扩展进程的堆之前检查该值(请参阅第 9 章中的“管理堆”部分)。 The maximum heap size, in bytes. The kernel checks this value before expanding the heap of the process (see the section "Managing the Heap" in Chapter 9). |
| 允许的最大文件大小(以字节为单位)。如果进程尝试将文件扩大到大于此值的大小,内核会向其发送信号 The maximum file size allowed, in
bytes. If the process tries to enlarge a file to a size
greater than this value, the kernel sends it a |
RLIMIT_LOCKS RLIMIT_LOCKS | 文件锁的最大数量(当前未强制执行)。 Maximum number of file locks (currently, not enforced). |
| 不可交换内存的最大大小(以字节为单位)。 The maximum size of nonswappable
memory, in bytes. The kernel checks this value when the
process tries to lock a page frame in memory using the
|
RLIMIT_MSGQUEUE RLIMIT_MSGQUEUE | POSIX 消息队列中的最大字节数(参见第 19 章中的“ POSIX 消息队列”部分)。 Maximum number of bytes in POSIX message queues (see the section "POSIX Message Queues" in Chapter 19). |
| 打开文件描述符的最大数量。内核在打开新文件或复制文件描述符时检查此值(请参阅第 12 章)。 The maximum number of open file descriptors . The kernel checks this value when opening a new file or duplicating a file descriptor (see Chapter 12). |
| 用户可以拥有的最大进程数(请参阅本章后面的“ clone( )、fork( ) 和 vfork( ) 系统调用”部分)。 The maximum number of processes that the user can own (see the section "The clone( ), fork( ), and vfork( ) System Calls" later in this chapter). |
| 进程拥有的最大页框数(当前未强制执行)。 The maximum number of page frames owned by the process (currently, not enforced). |
RLIMIT_SIGPENDING RLIMIT_SIGPENDING | 进程挂起信号的最大数量(参见第 11 章)。 The maximum number of pending signals for the process (see Chapter 11). |
| 最大堆栈大小(以字节为单位)。内核在扩展进程的用户模式堆栈之前检查该值(请参阅第 9 章中的“页面错误异常处理程序”部分)。 The maximum stack size, in bytes. The kernel checks this value before expanding the User Mode stack of the process (see the section "Page Fault Exception Handler" in Chapter 9). |
该rlim_cur字段是资源的当前资源限制。例如,current->signal->rlim[RLIMIT_CPU].rlim_cur
表示当前正在运行的进程的CPU时间的限制。
The rlim_cur field is the
current resource limit for the resource. For example, current->signal->rlim[RLIMIT_CPU].rlim_cur
represents the current limit on the CPU time of the running
process.
该rlim_max字段是资源限制的最大允许值。通过使用getrlimit( )和setrlimit( )系统调用,用户始终可以将rlim_cur某些资源的限制增加到rlim_max。然而,只有超级用户(或者更准确地说,有能力的用户CAP_SYS_RESOURCE)可以增加该rlim_max字段或将该
rlim_cur字段设置为大于相应rlim_max
字段的值。
The rlim_max field is the
maximum allowed value for the resource limit. By using the getrlimit( ) and setrlimit( ) system calls, a user can always
increase the rlim_cur limit of some
resource up to rlim_max. However,
only the superuser (or, more precisely, a user who has the CAP_SYS_RESOURCE capability) can increase
the rlim_max field or set the
rlim_cur field to a value greater
than the corresponding rlim_max
field.
大多数资源限制都包含值RLIM_INFINITY( 0xffffffff),这意味着相应的资源没有受到用户限制(当然,由于内核设计限制、可用 RAM、磁盘可用空间等,存在真正的限制)。然而,系统管理员可以选择对某些资源施加更强的限制。每当用户登录系统时,内核都会创建一个由超级用户拥有的进程,该进程可以调用该进程setrlimit( )来减少
资源的rlim_max和rlim_cur字段。同一进程稍后执行登录 shell 并由用户拥有。用户创建的每个新进程都会继承该进程的内容
rlim数组来自其父级,因此用户无法覆盖管理员强制执行的限制。
Most resource limits contain the value RLIM_INFINITY (0xffffffff), which means that no user limit
is imposed on the corresponding resource (of course, real limits exist
due to kernel design restrictions, available RAM, available space on
disk, etc.). However, the system administrator may choose to impose
stronger limits on some resources. Whenever a user logs into the
system, the kernel creates a process owned by the superuser, which can
invoke setrlimit( ) to decrease the
rlim_max and rlim_cur fields for a resource. The same
process later executes a login shell and becomes owned by the user.
Each new process created by the user inherits the content of the
rlim array from its parent, and
therefore the user cannot override the limits enforced by the
administrator.
[ * ]内核还定义了task_t与 等效的数据类型
struct task_struct。
[*] The kernel also defines the task_t data type to be equivalent to
struct task_struct.
[ * ]还有其他wait(
) -类似库函数,例如wait3( ) 和wait(
),但在 Linux 中它们是通过wait4( )和waitpid( )系统调用来实现的。
[*] There are other wait(
) -like library functions, such as wait3( ) and wait(
), but in Linux they are implemented by means of
the wait4( ) and waitpid( ) system calls.
[ † ]正如第 2 章“ Linux 中的分段”一节中已经指出的那样,尽管从技术上讲,这 32 位只是逻辑地址的偏移部分,但它们与线性地址一致。
[†] As already noted in the section "Segmentation in Linux" in Chapter 2, although technically these 32 bits are only the offset component of a logical address, they coincide with the linear address.
为了控制进程的执行,内核必须能够暂停CPU上运行的进程的执行,并恢复先前暂停的其他进程的执行。此活动有不同的名称:进程切换、任务切换或 上下文切换。接下来的部分将描述 Linux 中进程切换的要素。
To control the execution of processes, the kernel must be able to suspend the execution of the process running on the CPU and resume the execution of some other process previously suspended. This activity goes variously by the names process switch, task switch, or context switch. The next sections describe the elements of process switching in Linux.
虽然每个进程可以有自己的地址空间,但所有进程都必须共享 CPU 寄存器。因此,在恢复进程的执行之前,内核必须确保每个此类寄存器都加载了进程挂起时的值。
While each process can have its own address space, all processes have to share the CPU registers. So before resuming the execution of a process, the kernel must ensure that each such register is loaded with the value it had when the process was suspended.
进程在 CPU 上恢复执行之前必须加载到寄存器中的一组数据称为硬件 上下文 。硬件上下文是进程执行上下文的子集,它包括进程执行所需的所有信息。在Linux中,进程的硬件上下文的一部分存储在进程描述符中,而其余部分则保存在内核态堆栈中。
The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called the hardware context . The hardware context is a subset of the process execution context, which includes all information needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.
在下面的描述中,我们将假设prev局部变量指的是被切换出的进程的进程描述符,并且next指的是被切换进来以替换它的进程的进程描述符。因此,我们可以将进程切换定义
为包含保存 的硬件上下文prev并将其替换为 的硬件上下文的活动next。因为进程切换由于这种情况经常发生,因此最大限度地减少保存和加载硬件上下文所花费的时间非常重要。
In the description that follows, we will assume the prev local variable refers to the process
descriptor of the process being switched out and next refers to the one being switched in to
replace it. We can thus define a process switch
as the activity consisting of saving the hardware context of prev and replacing it with the hardware
context of next. Because process
switches occur quite often, it is important to minimize the time
spent in saving and loading hardware contexts.
老版本的Linux利用了80×86架构提供的硬件支持,通过一个进程进行进程切换。far jmp 指令[ * ]指向进程的任务状态段描述符的选择器next。在执行指令时,CPU通过自动保存旧硬件上下文并加载新硬件上下文来执行硬件上下文切换。但Linux 2.6使用软件来执行进程切换,原因如下:
Old versions of Linux took advantage of the hardware support
offered by the 80×86 architecture and performed a process switch
through a far jmp instruction[*] to the selector of the Task State Segment Descriptor of
the next process. While executing
the instruction, the CPU performs a hardware context
switch by automatically saving the old hardware context and
loading a new one. But Linux 2.6 uses software to perform a process
switch for the following reasons:
通过一系列指令执行的逐步切换
mov可以更好地控制正在加载的数据的有效性。特别是,可以检查ds和es分段寄存器的值,这些值可能是由恶意用户伪造的。当使用单个指令时,不可能进行这种类型的检查far
jmp。
Step-by-step switching performed through a sequence of
mov instructions allows better
control over the validity of the data being loaded. In particular,
it is possible to check the values of the ds and es segmentation registers, which might
have been forged by a malicious user. This type of checking is not
possible when using a single far
jmp instruction.
旧方法和新方法所需的时间大致相同。然而,不可能优化硬件上下文切换,但可能还有改进当前切换代码的空间。
The amount of time required by the old approach and the new approach is about the same. However, it is not possible to optimize a hardware context switch, while there might be room for improving the current switching code.
进程切换仅发生在内核态。在执行进程切换之前,用户模式下的进程使用的所有寄存器的内容已经保存在内核模式堆栈中(参见
第4章)。这包括指定用户模式堆栈指针地址的ss和对的内容。esp
Process switching occurs only in Kernel Mode. The contents of
all registers used by a process in User Mode have already been saved
on the Kernel Mode stack before performing process switching (see
Chapter 4). This includes
the contents of the ss and esp pair that specifies the User Mode stack
pointer address.
80×86 架构包括称为任务状态段(TSS)的特定段类型,用于存储硬件上下文。尽管Linux不使用硬件上下文切换,但它仍然被迫为每个不同的CPU设置一个TSS系统。这样做有两个主要原因:
The 80×86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts. Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
当80×86 CPU从用户模式切换到内核模式时,它从TSS中获取内核模式堆栈的地址(请参见第4章中的“中断和异常的硬件处理”和“通过sysenter发出系统调用”部分)说明”第10 章)。
When an 80×86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS (see the sections "Hardware Handling of Interrupts and Exceptions" in Chapter 4 and "Issuing a System Call via the sysenter Instruction" in Chapter 10).
当用户模式进程尝试通过in或访问 I/O 端口时out 指令时,CPU可能需要访问存储在TSS中的I/O权限位图来验证是否允许进程寻址该端口。
更准确地说,当进程在用户模式下执行inor outI/O 指令时,控制单元执行以下操作:
When a User Mode process attempts to access an I/O port by
means of an in or out instruction, the CPU may need to access an I/O
Permission Bitmap stored in the TSS to verify whether the process
is allowed to address the port.
More precisely, when a process executes an in or out I/O instruction in User Mode, the
control unit performs the following operations:
It checks the 2-bit IOPL field in the eflags register. If it is set to 3, the control unit
executes the I/O instructions. Otherwise, it performs the next
check.
It accesses the tr
register to determine the current TSS, and thus
the proper I/O Permission Bitmap.
It checks the bit of the I/O Permission Bitmap corresponding to the I/O port specified in the I/O instruction. If it is cleared, the instruction is executed; otherwise, the control unit raises a "General protection " exception.
该tss_struct结构描述了 TSS 的格式。正如第 2 章中已经提到的,init_tss阵列为系统上的每个 CPU 存储一个 TSS。在每次进程切换时,内核都会更新 TSS 的某些字段,以便相应 CPU 的控制单元可以安全地检索其所需的信息。因此,TSS反映了当前进程在CPU上的特权,但是当进程不运行时,不需要为进程维护TSS。
The tss_struct structure
describes the format of the TSS. As already mentioned in Chapter 2, the init_tss array stores one TSS for each CPU
on the system. At each process switch, the kernel updates some fields
of the TSS so that the corresponding CPU's control unit may safely
retrieve the information it needs. Thus, the TSS reflects the
privilege of the current process on the CPU, but there is no need to
maintain TSSs for processes when they're not running.
每个 TSS 都有自己的 8 字节任务状态段描述符(TSSD)。Base该描述符包括一个指向TSS起始地址的32位字段
和一个20位Limit
字段。TSSD 的标志S被清除,以表示相应的 TSS 是一个系统段(参见第 2 章中的“段描述符”
部分)。
Each TSS has its own 8-byte Task State Segment
Descriptor (TSSD). This descriptor includes a 32-bit
Base field that points to the TSS
starting address and a 20-bit Limit
field. The S flag of a TSSD is
cleared to denote the fact that the corresponding TSS is a System
Segment (see the section "Segment Descriptors" in
Chapter 2).
该Type字段设置为 9 或 11 以表示该段实际上是一个 TSS。在Intel最初的设计中,系统中的每个进程都应该引用自己的TSS;该字段的第二个最低有效位Type称为忙位;如果进程正在由 CPU 执行,则设置为 1,否则设置为 0。在Linux设计中,每个CPU只有一个TSS,因此Busy位始终设置为1。
The Type field is set to
either 9 or 11 to denote that the segment is actually a TSS. In the
Intel's original design, each process in the system should refer to
its own TSS; the second least significant bit of the Type field is called the Busy
bit; it is set to 1 if the process is being executed by a
CPU, and to 0 otherwise. In Linux design, there is just one TSS for
each CPU, so the Busy bit is always set to 1.
Linux创建的TSSD存储在全局描述符表(GDT)中,其基地址存储在gdtr 每个CPU的寄存器。每个CPU的寄存器tr包含对应TSS的TSSD Selector。该寄存器还包括两个隐藏的、不可编程的字段:TSSD 的Base和字段。Limit这样,处理器可以直接寻址 TSS,而不必从 GDT 检索 TSS 地址。
The TSSDs created by Linux are stored in the Global Descriptor
Table (GDT), whose base address is stored in the gdtr register of each CPU. The tr register of each CPU contains the TSSD
Selector of the corresponding TSS. The register also includes two
hidden, nonprogrammable fields: the Base and Limit fields of the TSSD. In this way, the
processor can address the TSS directly without having to retrieve the
TSS address from the GDT.
在每次进程切换时,被替换进程的硬件上下文必须保存在某处。它不能像英特尔最初的设计那样保存在 TSS 上,因为 Linux 为每个处理器使用一个 TSS,而不是为每个进程使用一个 TSS。
At every process switch, the hardware context of the process being replaced must be saved somewhere. It cannot be saved on the TSS, as in the original Intel design, because Linux uses a single TSS for each processor, instead of one for every process.
thread因此,每个进程描述符都包含一个名为类型 的字段thread_struct,每当进程被切换时,内核都会在其中保存硬件上下文。正如我们稍后将看到的,该数据结构包括大多数 CPU 寄存器的字段,但通用寄存器(如 、 等)除外eax,ebx这些寄存器存储在内核模式堆栈中。
Thus, each process descriptor includes a field called thread of type thread_struct, in which the kernel saves
the hardware context whenever the process is being switched out. As
we'll see later, this data structure includes fields for most of the
CPU registers, except the general-purpose registers such as eax, ebx, etc., which are stored in the Kernel
Mode stack.
进程切换可能只发生在一个明确定义的点:函数
,这将在第 7 章schedule( )中详细讨论。这里,我们只关心内核如何进行进程切换。
A process switch may occur at just one well-defined point: the
schedule( ) function, which is
discussed at length in Chapter
7. Here, we are only concerned with how the kernel performs a
process switch.
本质上,每个进程切换都包含两个步骤:
Essentially, every process switch consists of two steps:
切换页面全局目录以安装新的地址空间;我们将在第 9 章中描述这一步。
Switching the Page Global Directory to install a new address space; we'll describe this step in Chapter 9.
切换内核模式堆栈和硬件上下文,它提供内核执行新进程所需的所有信息,包括 CPU 寄存器。
Switching the Kernel Mode stack and the hardware context, which provides all the information needed by the kernel to execute the new process, including the CPU registers.
同样,我们假设prev
指向被替换的进程的描述符,以及next被激活的进程的描述符。正如我们将在第 7 章中看到的,prev和next是函数的局部变量schedule( )。
Again, we assume that prev
points to the descriptor of the process being replaced, and next to the descriptor of the process being
activated. As we'll see in Chapter
7, prev and next are local variables of the schedule( ) function.
进程切换的第二步由switch_to宏执行。它是内核中最依赖硬件的例程之一,需要花费一些努力才能理解它的作用。
The second step of the process switch is performed by
the switch_to macro. It is one of
the most hardware-dependent routines of the kernel, and it takes
some effort to understand what it does.
首先,该宏有三个参数,分别称为prev、next和last。prev你可能很容易猜到和的作用
next:它们只是局部变量prev和next的占位符,也就是说,它们是输入参数,分别指定包含被替换进程的描述符地址和新进程的描述符地址的内存位置。
First of all, the macro has three parameters, called prev, next, and last. You might easily guess the role of
prev and next: they are just placeholders for the
local variables prev and next, that is, they are input parameters
that specify the memory locations containing the descriptor address
of the process being replaced and the descriptor address of the new
process, respectively.
那么第三个参数呢last?嗯,任何进程切换都涉及三个进程,而不仅仅是两个。假设内核决定关闭进程A并激活进程B。在函数中schedule( ),prev指向A的描述符并next指向B的描述符。一旦switch_to宏停用 A,A 的执行流程就会冻结。
What about the third parameter, last? Well, in any process switch three
processes are involved, not just two. Suppose the kernel decides to
switch off process A and to activate process B. In the schedule( ) function, prev points to A's descriptor and next points to B's descriptor. As soon as
the switch_to macro deactivates
A, the execution flow of A freezes.
后来,当内核想要重新激活A时,它必须通过执行另一个指向C并
指向A的switch_to宏来关闭另一个进程C(一般来说,这与B不同)。当A恢复其执行流程时,它发现它的旧的内核模式堆栈,因此局部变量指向A的描述符并指向B的描述符。现在代表进程 A 执行的调度程序已经丢失了对 C 的任何引用。然而,该引用对于完成进程切换很有用(更多详细信息请参阅第 7 章)。prevnextprevnext
Later, when the kernel wants to reactivate A, it must switch
off another process C (in general, this is different from B) by
executing another switch_to macro
with prev pointing to C and
next pointing to A. When A
resumes its execution flow, it finds its old Kernel Mode stack, so
the prev local variable points to
A's descriptor and next points to
B's descriptor. The scheduler, which is now executing on behalf of
process A, has lost any reference to C. This reference, however,
turns out to be useful to complete the process switching (see Chapter 7 for more
details).
宏的最后一个参数switch_to是一个输出参数,它指定宏写入进程C的描述符地址的内存位置(当然,这是在A恢复执行之后完成的)。在进程切换之前,宏eax将第一个输入参数所标识的变量的内容保存在CPU
寄存器中prev,即prev分配在A的内核态堆栈上的局部变量。在进程切换之后,当A恢复执行时,宏将eaxCPU寄存器的内容写入由第三个输出参数标识的A的内存位置last。由于 CPU 寄存器在进程切换过程中不会发生变化,因此该内存位置接收 C 描述符的地址。在当前的实现中schedule( ),最后一个参数标识prevA 的局部变量,因此
prev被 C 的地址覆盖。
The last parameter of the switch_to macro is an output parameter
that specifies a memory location in which the macro writes the
descriptor address of process C (of course, this is done after A
resumes its execution). Before the process switching, the macro
saves in the eax CPU register the
content of the variable identified by the first input parameter
prev—that is, the prev local variable allocated on the
Kernel Mode stack of A. After the process switching, when A has
resumed its execution, the macro writes the content of the eax CPU register in the memory location of
A identified by the third output parameter last. Because the CPU register doesn't
change across the process switch, this memory location receives the
address of C's descriptor. In the current implementation of schedule( ), the last parameter identifies
the prev local variable of A, so
prev is overwritten with the
address of C.
进程A、B、C的内核态堆栈的内容以及寄存器的值如图3-7eax所示;prev请注意,该图显示了局部变量的值被寄存器的内容覆盖之前的eax值
。
The contents of the Kernel Mode stacks of processes A, B, and
C are shown in Figure
3-7, together with the values of the eax register; be warned that the figure
shows the value of the prev local
variable before its value is overwritten with
the contents of the eax
register.
该switch_to宏是用扩展内联汇编语言编码的
这使得阅读变得相当复杂:事实上,代码通过特殊的位置符号来引用寄存器,该符号允许编译器自由选择要使用的通用寄存器。我们将使用标准汇编语言来描述宏switch_to在 80×86 微处理器上通常执行的操作,而不是遵循繁琐的扩展内联汇编语言:
The switch_to macro is
coded in extended inline assembly language
that makes for rather complex reading: in fact, the
code refers to registers by means of a special positional notation
that allows the compiler to freely choose the general-purpose
registers to be used. Rather than follow the cumbersome extended
inline assembly language, we'll describe what the switch_to macro typically does on an 80×86
microprocessor by using standard assembly language:
prev分别将和的值保存next在eax和edx寄存器中:
movl 上一个,%eax
movl 下一个,%edxSaves the values of prev and next in the eax and edx registers, respectively:
movl prev, %eax
movl next, %edx保存内容eflags 并ebp
在内核模式堆栈中注册prev。必须保存它们,因为编译器假定它们将保持不变,直到switch_to:
普什弗
推入%ebpSaves the contents of the eflags and ebp
registers in the prev Kernel
Mode stack. They must be saved because the compiler assumes that
they will stay unchanged until the end of switch_to:
pushfl
pushl %ebpesp保存in的内容prev->thread.esp,使该字段指向prev
内核模式堆栈的顶部:
movl %esp,484(%eax)
该484(%eax)操作数标识地址为
eax加 484 内容的存储单元。
Saves the content of esp in prev->thread.esp so that the field
points to the top of the prev
Kernel Mode stack:
movl %esp,484(%eax)
The 484(%eax) operand
identifies the memory cell whose address is the contents of
eax plus 484.
加载next->thread.esp到esp. 从现在开始,内核在 的内核模式堆栈上运行,因此该指令执行从到 的next实际进程切换。由于进程描述符的地址与内核模式堆栈的地址密切相关(如本章前面的“识别进程”部分所述),因此更改内核堆栈意味着更改当前进程:prevnext
movl 484(%edx), %esp
Loads next->thread.esp in esp. From now on, the kernel operates
on the Kernel Mode stack of next, so this instruction performs the
actual process switch from prev to next. Because the address of a process
descriptor is closely related to that of the Kernel Mode stack
(as explained in the section "Identifying a
Process" earlier in this chapter), changing the kernel
stack means changing the current process:
movl 484(%edx), %esp
将标记的地址1(本节稍后显示)保存在 中
prev->thread.eip。当被替换的进程恢复执行时,该进程执行标记为的指令1:
movl $1f, 480(%eax)
Saves the address labeled 1 (shown later in this section) in
prev->thread.eip. When the
process being replaced resumes its execution, the process
executes the instruction labeled as 1:
movl $1f, 480(%eax)
在 的内核模式堆栈上next,宏压入next->thread.eip值,在大多数情况下,该值是标记为 的地址1:
推480(%edx)
On the Kernel Mode stack of next, the macro pushes the next->thread.eip value, which, in
most cases, is the address labeled as 1:
pushl 480(%edx)
跳转到_ _switch_to(
)C 函数(见下文):
jmp _ _switch_to
Jumps to the _ _switch_to(
) C function (see next):
jmp _ _switch_to
这里被 B 取代的进程 A 再次获得 CPU:它执行一些指令来恢复 和
eflags寄存器的内容ebp。这两个指令中的第一个被标记为1:
1:
人口 %ebp
流行音乐请注意这些指令如何pop
引用prev进程的内核堆栈。当调度程序选择prev要在 CPU 上执行的新进程时,它们将被执行,从而调用switch_towithprev作为第二个参数。因此,该esp寄存器指向的是prev内核态堆栈。
Here process A that was replaced by B gets the CPU again:
it executes a few instructions that restore the contents of the
eflags and ebp registers. The first of these two
instructions is labeled as 1:
1:
popl %ebp
popflNotice how these pop
instructions refer to the kernel stack of the prev process. They will be executed
when the scheduler selects prev as the new process to be executed
on the CPU, thus invoking switch_to with prev as the second parameter.
Therefore, the esp register
points to the prev's Kernel
Mode stack.
将寄存器的内容(在上面的步骤 1 中加载)复制到由宏的eax第三个参数标识的内存位置
:lastswitch_to
movl %eax,最后
如前所述,eax寄存器指向刚刚被替换的进程的描述符。[ * ]
Copies the content of the eax register (loaded in step 1 above)
into the memory location identified by the third parameter
last of the switch_to macro:
movl %eax, last
As discussed earlier, the eax register points to the descriptor
of the process that has just been replaced.[*]
该_ _switch_to( )
函数执行由宏启动的大部分进程切换switch_to( )。它作用于
表示前一个进程和新进程的prev_p和参数。next_p不过,此函数调用与普通函数调用不同,因为从和寄存器(我们看到它们存储的位置)_ _switch_to( )获取prev_p和参数,而不是像大多数函数那样从堆栈中获取。为了强制函数前往寄存器获取其参数,内核使用和关键字,它们是编译器实现的 C 语言的非标准扩展。该函数声明在next_peaxedx_ _attribute_ _regparmgcc_ _switch_to( )include /asm-i386 /system.h
头文件如下:
The _ _switch_to( )
function does the bulk of the process switch started by the switch_to( ) macro. It acts on the
prev_p and next_p parameters that denote the former
process and the new process. This function call is different from
the average function call, though, because _ _switch_to( ) takes the prev_p and next_p parameters from the eax and edx registers (where we saw they were
stored), not from the stack like most functions. To force the
function to go to the registers for its parameters, the kernel uses
the _ _attribute_ _ and regparm keywords, which are nonstandard
extensions of the C language implemented by the gcc compiler. The _ _switch_to( ) function is declared in
the include /asm-i386 /system.h
header file as follows:
_ _switch_to(struct task_struct *prev_p,
结构task_struct *next_p)
__属性__(regparm(3)); _ _switch_to(struct task_struct *prev_p,
struct task_struct *next_p)
_ _attribute_ _(regparm(3));该函数执行的步骤如下:
The steps performed by the function are the following:
执行宏生成的代码(请参阅“保存和加载 FPU”_ _unlazy_fpu( )部分)、MMX 和 XMM 寄存器(本章稍后))可选择保存 FPU、MMX 和 XMM 寄存器的内容的prev_p
过程。
__unlazy_fpu(prev_p);
Executes the code yielded by the _ _unlazy_fpu( ) macro (see the
section "Saving and
Loading the FPU , MMX, and XMM Registers" later in this
chapter) to optionally save the contents of the FPU, MMX, and
XMM registers of the prev_p
process.
_ _unlazy_fpu(prev_p);
执行smp_processor_id(
)宏获取本地CPU的索引 ,即执行代码的CPU。该宏从当前进程的结构cpu体字段中获取索引并将其存储到局部变量中。thread_infocpu
Executes the smp_processor_id(
) macro to get the index of the local
CPU , namely the CPU that executes the code. The
macro gets the index from the cpu field of the thread_info structure of the current
process and stores it into the cpu local variable.
TSS 场next_p->thread.esp0中相对于本地 CPU 的负载;正如我们将在“通过 sysenter 指令发出系统调用”esp0部分中看到的 在第 10 章中,任何将来由sysenter汇编指令引发的从用户模式到内核模式的特权级别更改都将在寄存器中复制此地址esp
:
init_tss[cpu].esp0 = next_p->thread.esp0;
Loads next_p->thread.esp0 in the esp0 field of the TSS relative to the
local CPU; as we'll see in the section "Issuing a System Call via
the sysenter Instruction " in Chapter
10, any future privilege level change from User Mode to
Kernel Mode raised by a sysenter assembly instruction will
copy this address in the esp
register:
init_tss[cpu].esp0 = next_p->thread.esp0;
将进程使用的线程本地存储(TLS)段加载到本地CPU的全局描述符表中next_p;三个段选择器存储在进程描述符内的数组中(请参阅第 2 章中的“ Linux 中的分段”tls_array部分)。
cpu_gdt_table[cpu][6] = next_p->thread.tls_array[0];
cpu_gdt_table[cpu][7] = next_p->thread.tls_array[1];
cpu_gdt_table[cpu][8] = next_p->thread.tls_array[2];Loads in the Global Descriptor Table of the local CPU the
Thread-Local Storage (TLS) segments used by the next_p process; the three Segment
Selectors are stored in the tls_array array inside the process
descriptor (see the section "Segmentation in
Linux" in Chapter
2).
cpu_gdt_table[cpu][6] = next_p->thread.tls_array[0];
cpu_gdt_table[cpu][7] = next_p->thread.tls_array[1];
cpu_gdt_table[cpu][8] = next_p->thread.tls_array[2];fs将和分段寄存器的内容分别存储gs在prev_p->thread.fs和prev_p->thread.gs中;对应的汇编语言指令为:
movl %fs, 40(%esi)
movl %gs, 44(%esi)寄存器esi指向prev_p->thread
结构体。
Stores the contents of the fs and gs segmentation registers in prev_p->thread.fs and prev_p->thread.gs, respectively;
the corresponding assembly language instructions are:
movl %fs, 40(%esi)
movl %gs, 44(%esi)The esi register points
to the prev_p->thread
structure.
如果该fs或该
gs分段寄存器已被该prev_p进程使用(即,如果它们具有非零值),则将存储在该进程的描述符next_p中的值加载到这些寄存器中。此步骤在逻辑上补充了上一步中执行的操作。主要的汇编语言指令有:thread_structnext_p
movl 40(%ebx),%fs
movl 44(%ebx),%gs寄存器ebx指向next_p->thread
结构体。该代码实际上更加复杂,因为当 CPU 检测到无效的段寄存器值时,可能会引发异常。该代码通过采用“修复”方法来考虑这种可能性(请参阅第 10 章中的“动态地址检查:修复代码”部分)。
If the fs or the
gs segmentation register have
been used either by the prev_p or by the next_p process (i.e., if they have a
nonzero value), loads into these registers the values stored in
the thread_struct descriptor
of the next_p process. This
step logically complements the actions performed in the previous
step. The main assembly language instructions are:
movl 40(%ebx),%fs
movl 44(%ebx),%gsThe ebx register points
to the next_p->thread
structure. The code is actually more intricate, as an exception
might be raised by the CPU when it detects an invalid segment
register value. The code takes this possibility into account by
adopting a "fix-up" approach (see the section "Dynamic Address Checking:
The Fix-up Code" in Chapter 10).
加载六个dr0,...,dr7调试寄存器 [ * ]为数组的内容next_p->thread.debugreg。next_p仅当在挂起时使用调试寄存器(即字段
next_p->thread.debugreg[7]
不为 0)时,才会执行此操作。这些寄存器不需要保存,因为
prev_p->thread.debugreg
只有当调试器想要监视时才会修改数组prev:
if (next_p->thread.debugreg[7]){
loaddebug(&next_p->线程, 0);
loaddebug(&next_p->线程, 1);
loaddebug(&next_p->线程, 2);
loaddebug(&next_p->线程, 3);
/* 没有 4 和 5 */
loaddebug(&next_p->线程, 6);
loaddebug(&next_p->线程, 7);
}Loads six of the dr0,..., dr7 debug registers [*] with the contents of the next_p->thread.debugreg array. This
is done only if next_p was
using the debug registers when it was suspended (that is, field
next_p->thread.debugreg[7]
is not 0). These registers need not be saved, because the
prev_p->thread.debugreg
array is modified only when a debugger wants to monitor prev:
if (next_p->thread.debugreg[7]){
loaddebug(&next_p->thread, 0);
loaddebug(&next_p->thread, 1);
loaddebug(&next_p->thread, 2);
loaddebug(&next_p->thread, 3);
/* no 4 and 5 */
loaddebug(&next_p->thread, 6);
loaddebug(&next_p->thread, 7);
}如有必要,更新 TSS 中的 I/O 位图。next_p当或
prev_p具有自己的自定义 I/O 权限位图时必须执行此操作:
if (prev_p->thread.io_bitmap_ptr || next_p->thread.io_bitmap_ptr)
handle_io_bitmap(&next_p->线程, &init_tss[cpu]);由于进程很少修改 I/O 权限位图,因此该位图以“惰性”模式处理:仅当进程在当前时间片内实际访问 I/O 端口时,才将实际位图复制到本地 CPU 的 TSS 中。进程的自定义I/O权限位图存储在io_bitmap_ptr该thread_info结构体的字段所指向的缓冲区中。该handle_io_bitmap( )
函数设置io_bitmap本地CPU为next_p
进程使用的TSS字段,如下所示:
如果next_p
进程没有自己定制的 I/O 权限位图,则io_bitmap
TSS 的字段设置为值0x8000。
如果next_p
进程有自己定制的 I/O 权限位图,则
io_bitmapTSS 的字段设置为值0x9000。
TSS 的字段io_bitmap应包含 TSS 内存储实际位图的偏移量。和值指向 TSS 限制之外,因此将导致“一般保护0x8000
”0x9000每当用户模式进程尝试访问 I/O 端口时都会发生异常(请参阅第 4 章中的“异常”
部分)。
异常处理程序将检查存储在该字段中的值:如果是,则该函数将向用户态进程;否则,如果是,则该函数复制本地 CPU 的 TSS 中的进程位图(由结构体中的字段指向),将该字段设置为实际位图偏移量(104),并强制创建一个新的进程位图。执行错误的汇编语言指令。do_general_protection( )io_bitmap0x8000SIGSEGV0x9000io_bitmap_ptrthread_infoio_bitmap
Updates the I/O bitmap in the TSS, if necessary. This must
be done when either next_p or
prev_p has its own customized
I/O Permission Bitmap:
if (prev_p->thread.io_bitmap_ptr || next_p->thread.io_bitmap_ptr)
handle_io_bitmap(&next_p->thread, &init_tss[cpu]);Because processes seldom modify the I/O Permission Bitmap,
this bitmap is handled in a "lazy" mode: the actual bitmap is
copied into the TSS of the local CPU only if a process actually
accesses an I/O port in the current timeslice. The customized
I/O Permission Bitmap of a process is stored in a buffer pointed
to by the io_bitmap_ptr field
of the thread_info structure.
The handle_io_bitmap( )
function sets up the io_bitmap field of the TSS used by the
local CPU for the next_p
process as follows:
If the next_p
process does not have its own customized I/O Permission
Bitmap, the io_bitmap
field of the TSS is set to the value 0x8000.
If the next_p
process has its own customized I/O Permission Bitmap, the
io_bitmap field of the
TSS is set to the value 0x9000.
The io_bitmap field of
the TSS should contain an offset inside the TSS where the actual
bitmap is stored. The 0x8000
and 0x9000 values point
outside of the TSS limit and will thus cause a "General
protection " exception whenever the User Mode process
attempts to access an I/O port (see the section "Exceptions" in
Chapter 4). The
do_general_protection( )
exception handler will check the value stored in the io_bitmap field: if it is 0x8000, the function sends a SIGSEGV signal to the User Mode
process; otherwise, if it is 0x9000, the function copies the
process bitmap (pointed to by the io_bitmap_ptr field in the thread_info structure) in the TSS of
the local CPU, sets the io_bitmap field to the actual bitmap
offset (104), and forces a new execution of the faulty assembly
language instruction.
终止。C_ _switch_to(
)函数以以下语句结束:
返回上一个p;
编译器生成的对应的汇编语言指令为:
movl %edi,%eax
雷特参数prev_p(现在在 中edi)被复制到 中
eax,因为默认情况下任何 C 函数的返回值都会在寄存器中传递eax。请注意, 的值
在 ;eax的调用过程中得以保留_ _switch_to(
)。这非常重要,因为调用
switch_to宏假定
eax始终存储被替换的进程描述符的地址。
汇编ret语言指令将eip返回地址加载到程序计数器中,返回地址存储在堆栈顶部。然而,该_ _switch_to( )函数只需跳转到其中即可被调用。因此,该ret指令在堆栈上找到标记为 的指令的地址1,该地址是由switch_to宏压入的。如果next_p之前从未因第一次执行而被挂起,则该函数会查找该函数的起始地址(请参阅本章后面的“ clone()、fork() 和 vfork() 系统调用ret_from_fork( )”部分) 。
Terminates. The _ _switch_to(
) C function ends by means of the statement:
return prev_p;
The corresponding assembly language instructions generated by the compiler are:
movl %edi,%eax
retThe prev_p parameter
(now in edi) is copied into
eax, because by default the
return value of any C function is passed in the eax register. Notice that the value of
eax is thus preserved across
the invocation of _ _switch_to(
); this is quite important, because the invoking
switch_to macro assumes that
eax always stores the address
of the process descriptor being replaced.
The ret assembly
language instruction loads the eip program counter with the return
address stored on top of the stack. However, the _ _switch_to( ) function has been
invoked simply by jumping into it. Therefore, the ret instruction finds on the stack the
address of the instruction labeled as 1, which was pushed by the switch_to macro. If next_p was never suspended before
because it is being executed for the first time, the function
finds the starting address of the ret_from_fork( ) function (see the
section "The clone(
), fork( ), and vfork( ) System Calls" later in this
chapter).
从Intel 80486DX开始,算术浮点单元(FPU)已集成到CPU中。数学协处理器这个名称
继续用于纪念浮点计算由昂贵的专用芯片执行的时代。然而,为了保持与旧型号的兼容性,浮点算术函数是使用ESCAPE 指令执行的 ,它们是前缀字节范围在0xd8和之间的指令0xdf。这些指令作用于 CPU 中包含的一组浮点寄存器。显然,如果进程正在使用 ESCAPE 指令,则浮点寄存器的内容属于其硬件上下文并且应该保存。
Starting with the Intel 80486DX, the arithmetic
floating-point unit (FPU) has been integrated into the CPU. The name
mathematical coprocessor continues to be used in
memory of the days when floating-point computations were executed by
an expensive special-purpose chip. To maintain compatibility with
older models, however, floating-point arithmetic functions are
performed with ESCAPE instructions , which are instructions with a prefix byte ranging
between 0xd8 and 0xdf. These instructions act on the set of
floating-point registers included in the CPU. Clearly, if a process is
using ESCAPE instructions, the contents of the floating-point
registers belong to its hardware context and should be saved.
在后来的奔腾型号中,英特尔在其微处理器中引入了一组新的汇编语言指令。它们被称为 MMX指令 并应该加速多媒体应用程序的执行。MMX 指令作用于 FPU 的浮点寄存器。这种架构选择的明显缺点是程序员不能混合使用浮点指令和 MMX 指令。优点是操作系统设计者可以忽略新指令集,因为也可以依靠用于保存浮点单元状态的任务切换代码的相同功能来保存MMX状态。
In later Pentium models, Intel introduced a new set of assembly language instructions into its microprocessors. They are called MMX instructions and are supposed to speed up the execution of multimedia applications. MMX instructions act on the floating-point registers of the FPU. The obvious disadvantage of this architectural choice is that programmers cannot mix floating-point instructions and MMX instructions. The advantage is that operating system designers can ignore the new instruction set, because the same facility of the task-switching code for saving the state of the floating-point unit can also be relied upon to save the MMX state.
MMX 指令可加快多媒体应用程序的速度,因为它们在处理器内部引入了单指令多数据 (SIMD) 管道。Pentium III 模型扩展了 SIMD 功能:它引入了SSE 扩展(流 SIMD 扩展),该扩展增加了处理 8 个 128 位寄存器(称为 XMM 寄存器)中包含的浮点值的设施。此类寄存器不与 FPU 和 MMX 寄存器重叠,因此SSE和FPU/MMX指令可以自由混合。Pentium 4 型号还引入了另一个功能:SSE2 扩展,它基本上是支持更高精度浮点值的 SSE 扩展。SSE2 使用与 SSE 相同的 XMM 寄存器集。
MMX instructions speed up multimedia applications, because they introduce a single-instruction multiple-data (SIMD) pipeline inside the processor. The Pentium III model extends that SIMD capability: it introduces the SSE extensions (Streaming SIMD Extensions), which adds facilities for handling floating-point values contained in eight 128-bit registers called the XMM registers . Such registers do not overlap with the FPU and MMX registers , so SSE and FPU/MMX instructions may be freely mixed. The Pentium 4 model introduces yet another feature: the SSE2 extensions, which is basically an extension of SSE supporting higher-precision floating-point values. SSE2 uses the same set of XMM registers as SSE.
80×86微处理器不会自动将FPU、MMX和XMM寄存器保存在TSS中。然而,它们包括一些硬件支持,使内核能够仅在需要时保存这些寄存器。硬件支持由一个TS(任务切换)标志组成cr0 注册,遵循以下规则:
The 80×86 microprocessors do not automatically save the FPU,
MMX, and XMM registers in the TSS. However, they include some hardware
support that enables kernels to save these registers only when needed.
The hardware support consists of a TS (Task-Switching) flag in the cr0 register, which obeys the following rules:
每次执行硬件上下文切换时,
TS都会设置该标志。
Every time a hardware context switch is performed, the
TS flag is set.
当设置标志时,每次执行 ESCAPE、MMX、SSE 或 SSE2 指令时TS,控制单元都会引发“设备不可用”” 例外(参见第 4 章)。
Every time an ESCAPE, MMX, SSE, or SSE2 instruction is
executed when the TS flag is
set, the control unit raises a "Device not available " exception (see Chapter 4).
该TS标志允许内核仅在真正需要时保存和恢复 FPU、MMX 和 XMM 寄存器。为了说明其工作原理,假设进程 A 正在使用数学协处理器。当从A到B发生上下文切换时,内核设置TS标志并将浮点寄存器保存到进程A的TSS中。如果新进程B不使用数学协处理器,则内核不需要恢复浮点寄存器的内容。但是,一旦 B 尝试执行 ESCAPE 或 MMX 指令,CPU 就会引发“设备不可用”异常,并且相应的处理程序会使用进程 B 的 TSS 中保存的值加载浮点寄存器。
The TS flag allows the kernel
to save and restore the FPU, MMX, and XMM registers only when really
needed. To illustrate how it works, suppose that a process A is using
the mathematical coprocessor. When a context switch occurs from A to
B, the kernel sets the TS flag and saves the floating-point registers
into the TSS of process A. If the new process B does not use the
mathematical coprocessor, the kernel won't need to restore the
contents of the floating-point registers. But as soon as B tries to
execute an ESCAPE or MMX instruction, the CPU raises a "Device not
available" exception, and the corresponding handler loads the
floating-point registers with the values saved in the TSS of process
B.
现在让我们描述一下为处理 FPU、MMX 和 XMM 寄存器的选择性加载而引入的数据结构。它们存储在thread.i387进程描述符的子字段中,其格式由并集描述i387_union:
Let's now describe the data structures introduced to handle
selective loading of the FPU, MMX, and XMM registers. They are stored
in the thread.i387 subfield of the
process descriptor, whose format is described by the i387_union union:
联盟 i387_union {
结构 i387_fsave_struct fsave;
结构 i387_fxsave_struct fxsave;
结构 i387_soft_struct 软;
}; union i387_union {
struct i387_fsave_struct fsave;
struct i387_fxsave_struct fxsave;
struct i387_soft_struct soft;
};正如您所看到的,该字段可能只存储三种不同类型的数据结构之一。该i387_soft_struct类型由没有数学协处理器的 CPU 模型使用;Linux 内核仍然通过软件模拟协处理器来支持这些旧芯片。不过,我们不会进一步讨论这个遗留案例。该i387_fsave_struct类型由具有数学协处理器和可选的 MMX 单元的 CPU 模型使用。最后,该i387_fxsave_struct类型由具有 SSE 和 SSE2 扩展功能的 CPU 型号使用。
As you see, the field may store just one of three different
types of data structures. The i387_soft_struct type is used by CPU models
without a mathematical coprocessor; the Linux kernel still supports
these old chips by emulating the coprocessor via software. We don't
discuss this legacy case further, however. The i387_fsave_struct type is used by CPU models
with a mathematical coprocessor and, optionally, an MMX unit. Finally,
the i387_fxsave_struct type is used
by CPU models featuring SSE and SSE2 extensions.
进程描述符包括两个附加标志:
The process descriptor includes two additional flags:
标志TS_USEDFPU,包含在描述符status
字段中thread_info
。它指定进程在当前执行运行中是否使用了 FPU、MMX 或 XMM 寄存器。
The TS_USEDFPU flag,
which is included in the status
field of the thread_info
descriptor. It specifies whether the process used the FPU, MMX, or
XMM registers in the current execution run.
标志PF_USED_MATH,包含在描述符flags
字段中task_struct
。该标志指定子字段的内容是否
thread.i387有效。该标志在两种情况下被清除(不重要),如下表所示。
The PF_USED_MATH flag,
which is included in the flags
field of the task_struct
descriptor. This flag specifies whether the contents of the
thread.i387 subfield are
significant. The flag is cleared (not significant) in two cases,
shown in the following list.
When the process starts executing a new program by
invoking an execve( )
system call (see Chapter
20). Because control will never return to the former
program, the data currently stored in thread.i387 is never used
again.
When a process that was executing a program in User Mode
starts executing a signal handler procedure (see Chapter 11). Because
signal handlers are asynchronous with respect to the program
execution flow, the floating-point registers could be
meaningless to the signal handler. However, the kernel saves
the floating-point registers in thread.i387 before starting the
handler and restores them after the handler terminates.
Therefore, a signal handler is allowed to use the mathematical
coprocessor.
如前所述,该_
_switch_to( )函数执行_ _unlazy_fpu宏,并将prev被替换进程的进程描述符作为参数传递。TS_USEDFPU该宏检查 的标志的值
prev。如果设置了该标志,prev则已使用 FPU、MMX、SSE 或 SSE2 指令;因此,内核必须保存相关的硬件上下文:
As stated earlier, the _
_switch_to( ) function executes the _ _unlazy_fpu macro, passing the process
descriptor of the prev process
being replaced as an argument. The macro checks the value of the
TS_USEDFPU flags of prev. If the flag is set, prev has used an FPU, MMX, SSE, or SSE2
instructions; therefore, the kernel must save the relative hardware
context:
if (上一个->thread_info->status & TS_USEDFPU)
save_init_fpu(上一个); if (prev->thread_info->status & TS_USEDFPU)
save_init_fpu(prev);该save_init_fpu( )
函数依次执行以下操作:
The save_init_fpu( )
function, in turn, executes essentially the following
operations:
将 FPU 寄存器的内容转储到进程描述符中prev,然后重新初始化 FPU。如果CPU使用SSE/SSE2扩展,它还会转储XMM寄存器的内容并重新初始化SSE/SSE2单元。几个强大的扩展内联汇编语言指令可以处理所有事情,或者:
asm 易失性(“fxsave %0; fnclex" : "=m" (prev->thread.i387.fxsave) );
如果CPU使用SSE/SSE2扩展,否则:
asm 易失性(“fnsave %0; 等等” : "=m" (prev->thread.i387.fsave) );
Dumps the contents of the FPU registers in the process
descriptor of prev and then
reinitializes the FPU. If the CPU uses SSE/SSE2 extensions, it
also dumps the contents of the XMM registers and reinitializes
the SSE/SSE2 unit. A couple of powerful extended inline assembly
language instructions take care of everything, either:
asm volatile( "fxsave %0 ; fnclex" : "=m" (prev->thread.i387.fxsave) );
if the CPU uses SSE/SSE2 extensions, or otherwise:
asm volatile( "fnsave %0 ; fwait" : "=m" (prev->thread.i387.fsave) );
重置TS_USEDFPU
标志prev:
上一个->thread_info->status &= ~TS_USEDFPU;
Resets the TS_USEDFPU
flag of prev:
prev->thread_info->status &= ~TS_USEDFPU;
设置 CW 标志cr0
通过stts(
)宏,实际上会产生如下汇编语言指令:
movl %cr0, %eax
或 $8,%eax
movl %eax, %cr0Sets the CW flag of cr0
by means of the stts(
) macro, which in practice yields assembly language
instructions like the following:
movl %cr0, %eax
orl $8,%eax
movl %eax, %cr0进程恢复执行后,浮点寄存器的内容不会立即恢复next
。然而,TS的标志cr0已由 __ 设置unlazy_fpu( )。因此,进程第一次
next尝试执行 ESCAPE、MMX 或 SSE/SSE2 指令时,控制单元会引发“设备不可用”异常,并且内核(更准确地说,异常涉及的异常处理程序)运行math_state_restore( )功能。该
next处理程序将该进程标识为current。
The contents of the floating-point registers are not
restored right after the next
process resumes execution. However, the TS flag of cr0 has been set by _ _unlazy_fpu( ). Thus, the first time the
next process tries to execute an
ESCAPE, MMX, or SSE/SSE2 instruction, the control unit raises a
"Device not available" exception, and the kernel (more precisely,
the exception handler involved by the exception) runs the math_state_restore( ) function. The
next process is identified by
this handler as current.
无效 math_state_restore( )
{
asm 易失性(“clts”);/* 清除cr0的TS标志 */
if (!(当前->标志 & PF_USED_MATH))
init_fpu(当前);
恢复_fpu(当前);
当前->thread.status |= TS_USEDFPU;
} void math_state_restore( )
{
asm volatile ("clts"); /* clear the TS flag of cr0 */
if (!(current->flags & PF_USED_MATH))
init_fpu(current);
restore_fpu(current);
current->thread.status |= TS_USEDFPU;
}该函数清除 的 CW 标志cr0,以便进程执行的进一步 FPU、MMX 或 SSE/SSE2 指令不会触发“设备不可用”异常。如果子字段的内容thread.i387不重要,即如果PF_USED_MATH标志等于 0,init_fpu()则调用以重置thread.i387子字段并将PF_USED_MATH标志设置current为 1。restore_fpu( )然后调用该函数以将存储的正确值加载到 FPU 寄存器在thread.i387子字段中。为此,要么fxrstor 或者frstor
使用汇编语言指令,取决于CPU是否支持SSE/SSE2扩展。最后,math_state_restore( )设置TS_USEDFPU标志。
The function clears the CW flags of cr0, so that further FPU, MMX, or SSE/SSE2
instructions executed by the process won't trigger the "Device not
available" exception. If the contents of the thread.i387 subfield are not significant,
i.e., if the PF_USED_MATH flag is
equal to 0, init_fpu() is invoked
to reset the thread.i387 subfield
and to set the PF_USED_MATH flag
of current to 1. The restore_fpu( ) function is then invoked to
load the FPU registers with the proper values stored in the thread.i387 subfield. To do this, either
the fxrstor or the frstor
assembly language instructions are used, depending on
whether the CPU supports SSE/SSE2 extensions. Finally, math_state_restore( ) sets the TS_USEDFPU flag.
甚至内核也可以使用 FPU、MMX 或 SSE/SSE2 单元。当然,这样做时应该避免干扰当前用户模式进程进行的任何计算。所以:
Even the kernel can make use of the FPU, MMX, or SSE/SSE2 units. In doing so, of course, it should avoid interfering with any computation carried on by the current User Mode process. Therefore:
在使用协处理器之前,内核必须调用
kernel_fpu_begin( ),save_init_fpu(
)如果用户态进程使用了 FPU(TS_USEDFPU标志),它本质上是调用保存寄存器的内容,然后重置
寄存器TS的标志cr0。
Before using the coprocessor, the kernel must invoke
kernel_fpu_begin( ), which
essentially calls save_init_fpu(
) to save the contents of the registers if the User
Mode process used the FPU (TS_USEDFPU flag), and then resets the
TS flag of the cr0 register.
使用协处理器后,内核必须调用 来
kernel_fpu_end( )设置TS寄存器的标志cr0。
After using the coprocessor, the kernel must invoke
kernel_fpu_end( ), which sets
the TS flag of the cr0 register.
随后,当用户模式进程执行协处理器指令时,该math_state_restore(
)函数将恢复寄存器的内容,就像进程切换处理一样。
Later, when the User Mode process executes a coprocessor
instruction, the math_state_restore(
) function will restore the contents of the registers,
just as in process switch handling.
然而,应该注意的是,当当前用户模式进程使用协处理器时,执行时间
kernel_fpu_begin( )相当长,以至于抵消了使用 FPU、MMX 或 SSE/SSE2 单元所获得的加速。事实上,内核仅在少数地方使用它们,通常是在移动或清除大内存区域或计算校验和函数时。
It should be noted, however, that the execution time of
kernel_fpu_begin( ) is rather
large when the current User Mode process is using the coprocessor,
so much as to nullify the speedup obtained by using the FPU, MMX, or
SSE/SSE2 units. As a matter of fact, the kernel uses them only in a
few places, typically when moving or clearing large memory areas or
when computing checksum functions.
[ * ]far jmp指令同时修改cs和eip寄存器,而简单jmp指令仅修改eip。
[*] far jmp instructions
modify both the cs and eip registers, while simple jmp instructions modify only eip.
[ * ]正如本节前面所述,函数的当前实现schedule(
)重用了prev局部变量,因此汇编语言指令看起来像movl %eax,prev.
[*] As stated earlier in this section, the current
implementation of the schedule(
) function reuses the prev local variable, so that the
assembly language instruction looks like movl %eax,prev.
[ * ] 80×86 调试寄存器允许硬件监视进程。最多可以定义四个断点区域。每当受监视的进程发出断点区域之一中包含的线性地址时,就会发生异常。
[*] The 80×86 debug registers allow a process to be monitored by the hardware. Up to four breakpoint areas may be defined. Whenever a monitored process issues a linear address included in one of the breakpoint areas, an exception occurs.
Unix 操作系统严重依赖进程创建来满足用户请求。例如,外壳程序创建一个新进程,每当用户输入命令时,该进程就会执行外壳程序的另一个副本。
Unix operating systems rely heavily on process creation to satisfy user requests. For example, the shell creates a new process that executes another copy of the shell whenever the user enters a command.
传统的 Unix 系统以相同的方式对待所有进程:父进程拥有的资源在子进程中复制。这种方法使得进程创建非常缓慢且低效,因为它需要复制父进程的整个地址空间。子进程很少需要读取或修改从父进程继承的所有资源;在许多情况下,它会发出立即指令execve( )并清除精心复制的地址空间。
Traditional Unix systems treat all processes in the same way:
resources owned by the parent process are duplicated in the child
process. This approach makes process creation very slow and inefficient,
because it requires copying the entire address space of the parent
process. The child process rarely needs to read or modify all the
resources inherited from the parent; in many cases, it issues an
immediate execve( ) and wipes out the
address space that was so carefully copied.
现代 Unix 内核通过引入三种不同的机制来解决这个问题:
Modern Unix kernels solve this problem by introducing three different mechanisms:
写入时复制技术允许父级和子级读取相同的物理页。每当任何一个尝试在物理页上写入时,内核都会将其内容复制到分配给写入进程的新物理页中。第 9 章详细解释了该技术在 Linux 中的实现 。
The Copy On Write technique allows both the parent and the child to read the same physical pages. Whenever either one tries to write on a physical page, the kernel copies its contents into a new physical page that is assigned to the writing process. The implementation of this technique in Linux is fully explained in Chapter 9.
轻量级进程允许父进程和子进程共享许多每个进程的内核数据结构,例如分页表(以及整个用户模式地址空间)、打开的文件表和信号配置。
Lightweight processes allow both the parent and the child to share many per-process kernel data structures, such as the paging tables (and therefore the entire User Mode address space), the open file tables, and the signal dispositions.
系统vfork( )调用创建一个共享其父进程的内存地址空间的进程。为了防止父程序覆盖子程序所需的数据,父程序的执行将被阻止,直到子程序退出或执行新程序为止。我们将vfork( )在下一节中了解有关系统调用的更多信息。
The vfork( ) system call
creates a process that shares the memory address space of its
parent. To prevent the parent from overwriting data needed by the
child, the parent's execution is blocked until the child exits or
executes a new program. We'll learn more about the vfork( ) system call in the following
section.
Linux 中的轻量级进程是通过使用名为 的函数创建的clone( ),该函数使用以下参数:
Lightweight processes are created in Linux by using a
function named clone( ), which uses
the following parameters:
fnfn指定新进程要执行的函数;当函数返回时,子进程终止。该函数返回一个整数,表示子进程的退出代码。
Specifies a function to be executed by the new process; when the function returns, the child terminates. The function returns an integer, which represents the exit code for the child process.
argarg指向传递给fn(
)函数的数据。
Points to data passed to the fn(
) function.
flagsflags杂项信息。低字节指定子进程终止时发送给父进程的信号号;SIGCHLD
一般选择信号。其余三个字节编码一组克隆标志,如表3-8所示。
Miscellaneous information. The low byte specifies the
signal number to be sent to the parent process when the child
terminates; the SIGCHLD
signal is generally selected. The remaining three bytes encode a
group of clone flags, which are shown in Table 3-8.
child_stackchild_stack指定要分配给esp子进程的寄存器的用户模式堆栈指针。调用进程(父进程)应始终为子进程分配一个新的堆栈。
Specifies the User Mode stack pointer to be assigned to
the esp register of the child
process. The invoking process (the parent) should always
allocate a new stack for the child.
tlstls指定为新的轻量级进程定义线程本地存储段的数据结构的地址(请参阅第 2 章中的“ Linux GDT ”
部分)。仅当设置了标志时才有意义。CLONE_SETTLS
Specifies the address of a data structure that defines a
Thread Local Storage segment for the new lightweight process
(see the section "The Linux GDT" in
Chapter 2). Meaningful
only if the CLONE_SETTLS flag
is set.
ptidptid指定父进程的用户模式变量的地址,该变量将保存新轻量级进程的 PID。仅当CLONE_PARENT_SETTID设置了标志时才有意义。
Specifies the address of a User Mode variable of the
parent process that will hold the PID of the new lightweight
process. Meaningful only if the CLONE_PARENT_SETTID flag is
set.
ctidctid指定新轻量级进程的用户模式变量的地址,该变量将保存该进程的 PID。仅当CLONE_CHILD_SETTID设置了标志时才有意义。
Specifies the address of a User Mode variable of the new
lightweight process that will hold the PID of such process.
Meaningful only if the CLONE_CHILD_SETTID flag is set.
表 3-8。克隆标志
Table 3-8. Clone flags
旗帜名称 Flag name | 描述 Description |
|---|---|
| 共享内存描述符和所有页表(参见第 9 章)。 Shares the memory descriptor and all Page Tables (see Chapter 9). |
| 共享标识根目录和当前工作目录的表,以及用于屏蔽新文件初始文件权限的位掩码值(即所谓的文件umask)。 Shares the table that identifies the root directory and the current working directory, as well as the value of the bitmask used to mask the initial file permissions of a new file (the so-called file umask ). |
| 共享标识打开文件的表(请参阅第 12 章)。 Shares the table that identifies the open files (see Chapter 12). |
| 共享标识信号处理程序以及阻塞和待处理信号的表(请参阅第 11 章)。如果该标志为 true, Shares the tables that identify the
signal handlers and the blocked and pending signals (see Chapter 11). If this flag
is true, the |
| 如果被追踪,父母希望孩子也被追踪。此外,调试器可能希望自己跟踪子进程;在这种情况下,内核强制该标志为 1。 If traced, the parent wants the child to be traced too. Furthermore, the debugger may want to trace the child on its own; in this case, the kernel forces the flag to 1. |
| 当发出的系统调用是 a 时设置
Set when the system call issued is a
|
| 将子进程的父进程( Sets the parent of the child
( |
| 将子进程插入到父进程的同一个线程组中,并强制子进程共享父进程的信号描述符。相应地设置子项 Inserts the child into the same
thread group of the parent, and forces the child to share the
signal descriptor of the parent. The child's |
| 设置克隆是否需要自己的命名空间,即其自己的已挂载文件系统视图(请参阅第 12 章);不可能同时指定 Set if the clone needs its own
namespace, that is, its own view of the mounted filesystems
(see Chapter 12);
it is not possible to specify both |
| 共享System V IPC可撤销信号量操作(参见第19章中的“ IPC信号量”部分)。 Shares the System V IPC undoable semaphore operations (see the section "IPC Semaphores" in Chapter 19). |
| 为轻量级进程创建一个新的线程本地存储(TLS)段;该段在参数指向的结构中进行描述 Creates a new Thread Local Storage
(TLS) segment for the lightweight process; the segment is
described in the structure pointed to by the |
| 将子进程的 PID 写入参数指向的父进程的用户模式变量中 Writes the PID of the child into the
User Mode variable of the parent pointed to by the |
| 设置后,内核会设置一种在子进程退出或开始执行新程序时触发的机制。在这些情况下,内核将清除参数指向的用户模式变量
When set, the kernel sets up a
mechanism to be triggered when the child process will exit or
when it will start executing a new program. In these cases,
the kernel will clear the User Mode variable pointed to by the
|
| 被内核忽略的旧标志。 A legacy flag ignored by the kernel. |
| 由内核设置以覆盖该标志的值 Set by the kernel to override the
value of the |
| 将子进程的 PID 写入参数指向的子进程的用户模式变量中 Writes the PID of the child into the
User Mode variable of the child pointed to by the |
| 强制孩子在该
Forces the child to start in the
|
clone( )实际上是C库中定义的包装函数(参见第10章的“ POSIX API和系统调用”部分),它设置新的轻量级进程的堆栈并调用对程序员隐藏的系统调用。实现系统调用的服务例程没有和
参数。事实上,包装函数将指针保存到与包装函数本身的返回地址相对应的子级堆栈位置中;指针保存在子级堆栈的正下方clone( )sys_clone(
)clone( )fnargfnargfn。当包装函数终止时,CPU 从堆栈中获取返回地址并执行该fn(arg)
函数。
clone( ) is actually a
wrapper function defined in the C library (see the section "POSIX APIs and System
Calls" in Chapter
10), which sets up the stack of the new lightweight process and
invokes a clone( ) system call
hidden to the programmer. The sys_clone(
) service routine that implements the clone( ) system call does not have the
fn and arg parameters. In fact, the wrapper
function saves the pointer fn into
the child's stack position corresponding to the return address of the
wrapper function itself; the pointer arg is saved on the child's stack right
below fn. When the wrapper function
terminates, the CPU fetches the return address from the stack and
executes the fn(arg)
function.
传统的fork( )
系统调用由Linux实现为一个clone( )系统调用,其flags参数既指定一个SIGCHLD信号又指定所有清除的克隆标志,其child_stack
参数是当前的父堆栈指针。因此,父进程和子进程暂时共享相同的用户模式堆栈。但由于写入时复制机制,一旦尝试更改堆栈,它们通常就会获得用户模式堆栈的单独副本。
The traditional fork( )
system call is implemented by Linux as a clone( ) system call whose flags parameter specifies both a SIGCHLD signal and all the clone flags
cleared, and whose child_stack
parameter is the current parent stack pointer. Therefore, the parent
and child temporarily share the same User Mode stack. But thanks to
the Copy On Write mechanism, they usually get separate copies of the
User Mode stack as soon as one tries to change the stack.
上一节中介绍的系统vfork( )调用是由 Linux 实现的,
clone( )其flags参数指定SIGCHLD信号和标志CLONE_VM和CLONE_VFORK,并且其child_stack参数等于当前父堆栈指针。
The vfork( ) system call,
introduced in the previous section, is implemented by Linux as a
clone( ) system call whose flags parameter specifies both a SIGCHLD signal and the flags CLONE_VM and CLONE_VFORK, and whose child_stack parameter is equal to the
current parent stack pointer.
该do_fork( )
函数处理clone(
)、fork( )和
vfork( )系统调用,作用于以下参数:
The do_fork( )
function, which handles the clone(
), fork( ), and
vfork( ) system calls, acts on
the following parameters:
clone_flagsclone_flagsflags
与参数相同clone(
)
Same as the flags
parameter of clone(
)
stack_startstack_startchild_stack与参数相同clone( )
Same as the child_stack parameter of clone( )
regsregs从用户模式切换到内核模式时,指向保存到内核模式堆栈中的通用寄存器值的指针(请参阅第 4 章中的“ do_IRQ( ) 函数”部分)
Pointer to the values of the general purpose registers saved into the Kernel Mode stack when switching from User Mode to Kernel Mode (see the section "The do_IRQ( ) function" in Chapter 4)
stack_sizestack_size未使用(始终设置为 0)
Unused (always set to 0)
parent_tidptr,
child_tidptrparent_tidptr,
child_tidptrptid对应的和ctid参数相同clone()
Same as the corresponding ptid and ctid parameters of clone()
do_fork( )使用一个辅助函数来copy_process(
)设置进程描述符和子进程执行所需的任何其他内核数据结构。以下是执行的主要步骤do_fork(
):
do_fork( ) makes use of an
auxiliary function called copy_process(
) to set up the process descriptor and any other kernel
data structure required for child's execution. Here are the main
steps performed by do_fork(
):
通过查看位图为子进程分配一个新的 PID
(请参阅前面的“识别进程pidmap_array”部分)。
Allocates a new PID for the child by looking in the
pidmap_array bitmap (see the
earlier section "Identifying a
Process").
检查ptrace父进程 ( current->ptrace) 的字段:如果它不为零,则父进程正在被另一个进程跟踪,从而do_fork( )检查调试器是否要自己跟踪子进程(与CLONE_PTRACE父进程指定的标志值无关) );在这种情况下,如果子线程不是内核线程(CLONE_UNTRACED标志已清除),则该函数会设置该CLONE_PTRACE标志。
Checks the ptrace field
of the parent (current->ptrace): if it is not
zero, the parent process is being traced by another process,
thus do_fork( ) checks
whether the debugger wants to trace the child on its own
(independently of the value of the CLONE_PTRACE flag specified by the
parent); in this case, if the child is not a kernel thread
(CLONE_UNTRACED flag
cleared), the function sets the CLONE_PTRACE flag.
调用copy_process()
以创建进程描述符的副本。如果所有需要的资源都可用,则该函数返回刚刚创建的描述符的地址task_struct。这是分叉过程的主力,我们将在之后描述它do_fork( )。
Invokes copy_process()
to make a copy of the process descriptor. If all needed
resources are available, this function returns the address of
the task_struct descriptor
just created. This is the workhorse of the forking procedure,
and we will describe it right after do_fork( ).
如果CLONE_STOPPED设置了该标志或者必须跟踪子进程,即PT_PTRACED在 中设置了该标志p->ptrace,它将把子进程的状态设置为 并向其TASK_STOPPED添加一个待处理信号(请参阅第 1 章中的“信号的作用”部分) 11)。子进程的状态将保持不变,直到另一个进程(可能是跟踪进程或父进程)将其状态恢复为,通常通过
信号的方式。SIGSTOPTASK_STOPPEDTASK_RUNNINGSIGCONT
If either the CLONE_STOPPED flag is set or the child
process must be traced, that is, the PT_PTRACED flag is set in p->ptrace, it sets the state of the
child to TASK_STOPPED and
adds a pending SIGSTOP signal
to it (see the section "The Role of Signals"
in Chapter 11). The
state of the child will remain TASK_STOPPED until another process
(presumably the tracing process or the parent) will revert its
state to TASK_RUNNING,
usually by means of a SIGCONT
signal.
如果CLONE_STOPPED
未设置该标志,它将调用该wake_up_new_task( )函数,该函数执行以下操作:
If the CLONE_STOPPED
flag is not set, it invokes the wake_up_new_task( ) function, which
performs the following operations:
Adjusts the scheduling parameters of both the parent and the child (see "The Scheduling Algorithm" in Chapter 7).
If the child will run on the same CPU as the
parent,[*] and parent and child do not share the same set
of page tables (CLONE_VM
flag cleared), it then forces the child to run before the
parent by inserting it into the parent's runqueue right
before the parent. This simple step yields better
performance if the child flushes its address space and
executes a new program right after the forking. If we let
the parent run first, the Copy On Write mechanism would give
rise to a series of unnecessary page duplications.
Otherwise, if the child will not be run on the same
CPU as the parent, or if parent and child share the same set
of page tables (CLONE_VM
flag set), it inserts the child in the last position of the
parent's runqueue.
如果CLONE_STOPPED
设置了该标志,则会将子进程置于该TASK_STOPPED状态。
If the CLONE_STOPPED
flag is set, it puts the child in the TASK_STOPPED state.
如果正在跟踪父进程,它会将子进程的 PID 存储在 字段中ptrace_message并current调用ptrace_notify( ),这实际上会停止当前进程并向SIGCHLD其父进程发送信号。子级的“祖父母”是跟踪父级的调试器;该SIGCHLD信号通知调试器current已经分叉了一个子进程,可以通过查看该current->ptrace_message
字段来检索其 PID。
If the parent process is being traced, it stores the PID
of the child in the ptrace_message field of current and invokes ptrace_notify( ), which essentially
stops the current process and sends a SIGCHLD signal to its parent. The
"grandparent" of the child is the debugger that is tracing the
parent; the SIGCHLD signal
notifies the debugger that current has forked a child, whose PID
can be retrieved by looking into the current->ptrace_message
field.
如果CLONE_VFORK指定了该标志,它将把父进程插入等待队列并挂起它,直到子进程释放其内存地址空间(即,直到子进程终止或执行新程序)。
If the CLONE_VFORK flag
is specified, it inserts the parent process in a wait queue and
suspends it until the child releases its memory address space
(that is, until the child either terminates or executes a new
program).
通过返回子进程的 PID 来终止。
Terminates by returning the PID of the child.
该copy_process( )
函数设置进程描述符和子进程执行所需的任何其他内核数据结构。它的参数和 一样do_fork( ),加上子进程的PID。以下是对其最重要步骤的描述:
The copy_process( )
function sets up the process descriptor and any other kernel data
structure required for a child's execution. Its parameters are the
same as do_fork( ), plus the PID
of the child. Here is a description of its most significant
steps:
检查参数中传递的标志是否clone_flags兼容。特别是,它在以下情况下返回错误代码:
标志CLONE_NEWNS和均已CLONE_FS设置。
该CLONE_THREAD
标志已设置,但该CLONE_SIGHAND标志已清除(同一线程组中的轻量级进程必须共享信号)。
该CLONE_SIGHAND
标志已设置,但该CLONE_VM标志已清除(共享信号处理程序的轻量级进程也必须共享内存描述符)。
Checks whether the flags passed in the clone_flags parameter are compatible.
In particular, it returns an error code in the following
cases:
Both the flags CLONE_NEWNS and CLONE_FS are set.
The CLONE_THREAD
flag is set, but the CLONE_SIGHAND flag is cleared
(lightweight processes in the same thread group must share
signals).
The CLONE_SIGHAND
flag is set, but the CLONE_VM flag is cleared
(lightweight processes sharing the signal handlers must also
share the memory descriptor).
security_task_create( )通过调用以及稍后的来执行任何其他安全检查
security_task_alloc(
)。Linux 内核 2.6 提供了安全扩展挂钩,这些挂钩强制执行比传统 Unix 采用的安全模型更强的安全模型。详细信息请参见第 20 章。
Performs any additional security checks by invoking
security_task_create( ) and,
later, security_task_alloc(
). The Linux kernel 2.6 offers hooks for security
extensions that enforce a security model stronger than the one
adopted by traditional Unix. See Chapter 20 for
details.
调用dup_task_struct(
)以获取子进程的进程描述符。该函数执行以下操作:
unlazy_fpu(
)如有必要,对当前进程
调用 _ _以将 FPU、MMX 和 SSE/SSE2 寄存器的内容保存在thread_info父级结构中。稍后,dup_task_struct(
)会将这些值复制到thread_info子结构中。
执行宏以获取新进程的alloc_task_struct( )进程描述符(结构体),并将其地址存储在局部变量中。task_structtsk
执行该alloc_thread_info宏获取一块空闲内存区域来存储thread_info新进程的结构体和内核态堆栈,并将其地址保存在ti局部变量中。正如前面“识别进程”部分所述,该内存区域的大小为 8 KB 或 4 KB。
current将的进程描述符的内容复制到task_struct指向的结构中tsk,然后设置tsk->thread_info
为ti。
current将的描述符的内容复制thread_info到 指向的结构中ti,然后设置ti->task为tsk。
将新进程描述符 ( tsk->usage) 的使用计数器设置为 2,以指定进程描述符正在使用并且相应的进程处于活动状态(其状态不是EXIT_ZOMBIE或EXIT_DEAD)。
返回新进程的进程描述符指针 ( tsk)。
Invokes dup_task_struct(
) to get the process descriptor for the child. This
function performs the following actions:
Invokes _ _unlazy_fpu(
) on the current process to save, if necessary,
the contents of the FPU, MMX, and SSE/SSE2 registers in the
thread_info structure of
the parent. Later, dup_task_struct(
) will copy these values in the thread_info structure of the
child.
Executes the alloc_task_struct( ) macro to get
a process descriptor (task_struct structure) for the new
process, and stores its address in the tsk local variable.
Executes the alloc_thread_info macro to get a
free memory area to store the thread_info structure and the
Kernel Mode stack of the new process, and saves its address
in the ti local variable.
As explained in the earlier section "Identifying a
Process," the size of this memory area is either 8 KB
or 4 KB.
Copies the contents of the current's process descriptor into
the task_struct structure
pointed to by tsk, then
sets tsk->thread_info
to ti.
Copies the contents of the current's thread_info descriptor into the
structure pointed to by ti, then sets ti->task to tsk.
Sets the usage counter of the new process descriptor
(tsk->usage) to 2 to
specify that the process descriptor is in use and that the
corresponding process is alive (its state is not EXIT_ZOMBIE or EXIT_DEAD).
Returns the process descriptor pointer of the new
process (tsk).
检查值是否存储在current->signal->rlim[RLIMIT_NPROC]. rlim_cur小于或等于用户当前拥有的进程数。如果是,则返回错误代码,除非该进程具有 root 权限。该函数从名为 的每用户数据结构中获取用户当前拥有的进程数user_struct。user可以通过进程描述符字段中的指针找到该数据结构。
Checks whether the value stored in current->signal->rlim[RLIMIT_NPROC].rlim_cur is smaller than or equal to
the current number of processes owned by the user. If so, an
error code is returned, unless the process has root privileges.
The function gets the current number of processes owned by the
user from a per-user data structure named user_struct. This data structure can
be found through a pointer in the user field of the process
descriptor.
增加结构体的使用计数器user_struct(tsk->user->_ _count字段)和用户拥有的进程的计数器(tsk->user->processes)。
Increases the usage counter of the user_struct structure (tsk->user->_ _count field) and
the counter of the processes owned by the user (tsk->user->processes).
检查系统中的进程数(存储在nr_threads变量中)是否不超过变量的值max_threads。该变量的默认值取决于系统中的 RAM 量。一般规则是所有描述符和内核模式堆栈占用的空间thread_info不能超过物理内存的1/8。但是,系统管理员可以通过写入
/proc/sys/kernel/threads-max
文件来更改此值。
Checks that the number of processes in the system (stored
in the nr_threads variable)
does not exceed the value of the max_threads variable. The default
value of this variable depends on the amount of RAM in the
system. The general rule is that the space taken by all thread_info descriptors and Kernel
Mode stacks cannot exceed 1/8 of the physical memory. However,
the system administrator may change this value by writing in the
/proc/sys/kernel/threads-max
file.
如果实现新进程的执行域和可执行格式(参见第 20 章)的内核函数包含在内核模块中,则会增加它们的使用计数器(参见附录 B)。
If the kernel functions implementing the execution domain and the executable format (see Chapter 20) of the new process are included in kernel modules, it increases their usage counters (see Appendix B).
设置与进程状态相关的几个关键字段:
Sets a few crucial fields related to the process state:
Initializes the big kernel lock counter tsk->lock_depth to -1 (see the section "The Big Kernel
Lock" in Chapter
5).
Initializes the tsk->did_exec field to 0: it
counts the number of execve(
) system calls issued by the process.
Updates some of the flags included in the tsk->flags field that have been
copied from the parent process: first clears the PF_SUPERPRIV flag, which indicates
whether the process has used any of its superuser
privileges, then sets the PF_FORKNOEXEC flag, which
indicates that the child has not yet issued an execve( ) system call.
将新进程的 PID 存储在tsk->pid字段中。
Stores the PID of the new process in the tsk->pid field.
如果设置了参数CLONE_PARENT_SETTID中的标志
clone_flags,则会将子进程的 PID 复制到该参数所寻址的用户模式变量中parent_tidptr
。
If the CLONE_PARENT_SETTID flag in the
clone_flags parameter is set,
it copies the child's PID into the User Mode variable addressed
by the parent_tidptr
parameter.
初始化list_head子进程描述符中包含的数据结构和自旋锁,并设置与挂起信号、计时器和时间统计相关的其他几个字段。
Initializes the list_head data structures and the spin
locks included in the child's process descriptor, and sets up
several other fields related to pending signals, timers, and
time statistics.
调用copy_semundo(
)、copy_files( )、
copy_fs( )、copy_sighand( )、copy_signal( )、copy_mm( )和copy_namespace( )来创建新的数据结构并将相应父进程数据结构的值复制到其中,除非参数另有指定clone_flags
。
Invokes copy_semundo(
), copy_files( ),
copy_fs( ), copy_sighand( ), copy_signal( ), copy_mm( ), and copy_namespace( ) to create new data
structures and copy into them the values of the corresponding
parent process data structures, unless specified differently by
the clone_flags
parameter.
当发出系统调用时,使用copy_thread( )
CPU 寄存器中包含的值来初始化子进程的内核模式堆栈clone( )(这些值已保存在父进程的内核模式堆栈中,如第 10 章所述)。但是,该函数将值 0 强制写入寄存器对应的字段(这是或
系统调用eax的子级的返回值)。子进程描述符中的字段
使用子进程的内核模式堆栈的基地址和汇编语言函数的地址进行初始化(fork()clone( )thread.espret_from_fork( )) 存储在该
thread.eip字段中。如果父进程使用 I/O 权限位图,子进程将获得该位图的副本。最后,如果设置了该标志,则子进程将获取由系统调用
的参数CLONE_SETTLS指向的用户模式数据结构指定的 TLS 段
。[ * ]tlsclone( )
Invokes copy_thread( )
to initialize the Kernel Mode stack of the child process with
the values contained in the CPU registers when the clone( ) system call was issued (these
values have been saved in the Kernel Mode stack of the parent,
as described in Chapter
10). However, the function forces the value 0 into the
field corresponding to the eax register (this is the child's
return value of the fork() or
clone( ) system call). The
thread.esp field in the
descriptor of the child process is initialized with the base
address of the child's Kernel Mode stack, and the address of an
assembly language function (ret_from_fork( )) is stored in the
thread.eip field. If the
parent process makes use of an I/O Permission Bitmap, the child
gets a copy of such bitmap. Finally, if the CLONE_SETTLS flag is set, the child
gets the TLS segment specified by the User Mode data structure
pointed to by the tls
parameter of the clone( )
system call.[*]
如果在参数中设置了
CLONE_CHILD_SETTIDor ,则会分别复制or字段中的参数值。这些标志指定必须更改子级的用户模式地址空间中指向的变量的值,尽管实际的写入操作将在稍后完成。CLONE_CHILD_CLEARTIDclone_flagschild_tidptrtsk->set_chid_tidtsk->clear_child_tidchild_tidptr
If either CLONE_CHILD_SETTID or CLONE_CHILD_CLEARTID is set in the
clone_flags parameter, it
copies the value of the child_tidptr parameter in the tsk->set_chid_tid or tsk->clear_child_tid field,
respectively. These flags specify that the value of the variable
pointed to by child_tidptr in
the User Mode address space of the child has to be changed,
although the actual write operations will be done later.
关闭子结构TIF_SYSCALL_TRACE中的标志,以便该
函数不会通知调试进程有关系统调用终止的信息(参见第10章“进入和退出系统调用”一节)。(子进程的系统调用跟踪没有被禁用,因为它是由 中的标志控制的。)thread_inforet_from_fork( )PTRACE_SYSCALLtsk->ptrace
Turns off the TIF_SYSCALL_TRACE flag in the thread_info structure of the child, so
that the ret_from_fork( )
function will not notify the debugging process about the system
call termination (see the section "Entering and Exiting a
System Call" in Chapter 10). (The system
call tracing of the child is not disabled, because it is
controlled by the PTRACE_SYSCALL flag in tsk->ptrace.)
使用编码tsk->exit_signal在参数低位中的信号编号来初始化该字段clone_flags,除非
CLONE_THREAD设置了标志,在这种情况下将该字段初始化为1。 正如我们将在本章后面的“进程终止-”部分中看到的,只有线程组的最后一个成员(通常是线程组领导者)的死亡会导致一个信号通知线程组领导者的父级。
Initializes the tsk->exit_signal field with the
signal number encoded in the low bits of the clone_flags parameter, unless the
CLONE_THREAD flag is set, in
which case initializes the field to -1. As we'll see in the section "Process
Termination" later in this chapter, only the death of the
last member of a thread group (usually, the thread group leader)
causes a signal notifying the parent of the thread group
leader.
调用sched_fork( )
完成新进程的调度程序数据结构的初始化。该函数还将新进程的状态设置为并将结构体的字段TASK_RUNNING设置为 1,从而禁用内核抢占(请参阅第 5 章中的“内核抢占”部分)。此外,为了保持进程调度公平,该函数在父进程和子进程之间共享父进程的剩余时间片(参见第7章中的“ scheduler_tick()函数” )。preempt_countthread_info
Invokes sched_fork( )
to complete the initialization of the scheduler data structure
of the new process. The function also sets the state of the new
process to TASK_RUNNING and
sets the preempt_count field
of the thread_info structure
to 1, thus disabling kernel preemption (see the section "Kernel Preemption"
in Chapter 5).
Moreover, in order to keep process scheduling fair, the function
shares the remaining timeslice of the parent between the parent
and the child (see "The scheduler_tick( )
Function" in Chapter
7).
cpu将新进程的结构体中的字段设置thread_info为返回的本地CPU的编号
smp_processor_id( )。
Sets the cpu field in
the thread_info structure of
the new process to the number of the local CPU returned by
smp_processor_id( ).
初始化指定父子关系的字段。特别是,如果设置了
CLONE_PARENT或,则将和
初始化为;中的值。因此,子进程的父进程显示为当前进程的父进程。否则,它将相同的字段设置为。CLONE_THREADtsk->real_parenttsk->parentcurrent->real_parentcurrent
Initializes the fields that specify the parenthood
relationships. In particular, if CLONE_PARENT or CLONE_THREAD are set, it initializes
tsk->real_parent and
tsk->parent to the value
in current->real_parent;
the parent of the child thus appears as the parent of the
current process. Otherwise, it sets the same fields to current.
如果子进程不需要被跟踪(CLONE_PTRACE未设置标志),则将该tsk->ptrace字段设置为 0。该字段存储当一个进程被另一个进程跟踪时使用的一些标志。这样,即使当前进程正在被跟踪,子进程也不会被跟踪。
If the child does not need to be traced (CLONE_PTRACE flag not set), it sets
the tsk->ptrace field to
0. This field stores a few flags used when a process is being
traced by another process. In such a way, even if the current
process is being traced, the child will not.
执行SET_LINKS
宏以将新的进程描述符插入到进程列表中。
Executes the SET_LINKS
macro to insert the new process descriptor in the process
list.
如果必须跟踪子级(字段设置PT_PTRACED中的标志tsk->ptrace),则将其设置
tsk->parent为current->parent并将子级插入调试器的跟踪列表中。
If the child must be traced (PT_PTRACED flag in the tsk->ptrace field set), it sets
tsk->parent to current->parent and inserts the
child into the trace list of the debugger.
调用attach_pid( )
将新进程描述符的 PID 插入哈希pidhash[PIDTYPE_PID]表中。
Invokes attach_pid( )
to insert the PID of the new process descriptor in the pidhash[PIDTYPE_PID] hash
table.
如果子进程是线程组领导者(CLONE_THREAD清除标志):
初始化tsk->tgid为tsk->pid.
初始化tsk->group_leader为tsk.
调用三次以将子项插入类型为、和 的attach_pid( )PID 哈希表中。PIDTYPE_TGIDPIDTYPE_PGIDPIDTYPE_SID
If the child is a thread group leader (flag CLONE_THREAD cleared):
Initializes tsk->tgid to tsk->pid.
Initializes tsk->group_leader to tsk.
Invokes three times attach_pid( ) to insert the child
in the PID hash tables of type PIDTYPE_TGID, PIDTYPE_PGID, and PIDTYPE_SID.
否则,如果子进程属于其父进程的线程组(CLONE_THREAD设置了标志):
初始化tsk->tgid为tsk->current->tgid.
初始化tsk->group_leader为 中的值current->group_leader。
调用将attach_pid(
)子进程插入到PIDTYPE_TGID哈希表中(更具体地说,插入到进程的每个 PID 列表中current->group_leader
)。
Otherwise, if the child belongs to the thread group of its
parent (CLONE_THREAD flag
set):
Initializes tsk->tgid to tsk->current->tgid.
Initializes tsk->group_leader to the value
in current->group_leader.
Invokes attach_pid(
) to insert the child in the PIDTYPE_TGID hash table (more
specifically, in the per-PID list of the current->group_leader
process).
现在已将新进程添加到进程集中:增加nr_threads变量的值。
A new process has now been added to the set of processes:
increases the value of the nr_threads variable.
增加total_forks变量以跟踪分叉进程的数量。
Increases the total_forks variable to keep track of
the number of forked processes.
通过返回子进程描述符指针 ( tsk) 来终止。
Terminates by returning the child's process descriptor
pointer (tsk).
让我们回到do_fork()终止后发生的情况。现在我们有了一个完整的处于可运行状态的子进程。但它实际上并没有运行。由调度程序决定何时将 CPU 分配给该子进程。在将来的某个进程切换时,调度程序通过向一些 CPU 寄存器加载子进程描述符字段的值来为子进程提供这种便利thread。特别是,esp被加载thread.esp(即,子进程的内核模式堆栈的地址),并且eip被加载 的地址ret_from_fork( )。该汇编语言函数调用该schedule_tail(
)函数(该函数又调用finish_task_switch( )函数完成进程切换;请参阅第 7 章中的“ schedule( ) 函数”部分),用存储在堆栈中的值重新加载所有其他寄存器,并强制 CPU 返回用户模式。然后,新进程在、或系统调用结束时开始执行。系统调用返回的值包含在:对于子进程,该值为 0,对于子进程的父进程,该值等于 PID。要了解这是如何完成的,请回顾一下子进程的寄存器中的操作(第 13 步)。fork( )vfork(
)clone( )eaxcopy_thread()eaxcopy_process()
Let's go back to what happens after do_fork() terminates. Now we have a
complete child process in the runnable state. But it isn't actually
running. It is up to the scheduler to decide when to give the CPU to
this child. At some future process switch, the schedule bestows this
favor on the child process by loading a few CPU registers with the
values of the thread field of the
child's process descriptor. In particular, esp is loaded with thread.esp (that is, with the address of
child's Kernel Mode stack), and eip is loaded with the address of ret_from_fork( ). This assembly language
function invokes the schedule_tail(
) function (which in turn invokes the finish_task_switch( ) function to complete
the process switch; see the section "The schedule( )
Function" in Chapter
7), reloads all other registers with the values stored in the
stack, and forces the CPU back to User Mode. The new process then
starts its execution right at the end of the fork( ), vfork(
), or clone( ) system
call. The value returned by the system call is contained in eax: the value is 0 for the child and
equal to the PID for the child's parent. To understand how this is
done, look back at what copy_thread() does on the eax register of the child's process (step
13 of copy_process()).
子进程执行与父进程相同的代码,只是 fork 返回 0(参见 的步骤 13 copy_process( ))。应用程序的开发人员可以以 Unix 程序员熟悉的方式利用这一事实,即根据 PID 值在程序中插入条件语句,强制子进程的行为与父进程不同。
The child process executes the same code as the parent, except
that the fork returns a 0 (see step 13 of copy_process( )). The developer of the
application can exploit this fact, in a manner familiar to Unix
programmers, by inserting a conditional statement in the program
based on the PID value that forces the child to behave differently
from the parent process.
传统的 Unix 系统将一些关键任务委托给间歇性运行的进程,包括刷新磁盘缓存、交换未使用的页面、维护网络连接等。事实上,以严格的线性方式执行这些任务效率不高。如果将它们安排在后台,它们的功能和最终用户进程都会得到更好的响应。由于某些系统进程仅在内核模式下运行,因此现代操作系统将其功能委托给内核线程 ,它们不会受到不必要的用户模式上下文的阻碍。在 Linux 中,内核线程与常规进程有以下不同之处:
Traditional Unix systems delegate some critical tasks to intermittently running processes, including flushing disk caches, swapping out unused pages, servicing network connections, and so on. Indeed, it is not efficient to perform these tasks in strict linear fashion; both their functions and the end user processes get better response if they are scheduled in the background. Because some of the system processes run only in Kernel Mode, modern operating systems delegate their functions to kernel threads , which are not encumbered with the unnecessary User Mode context. In Linux, kernel threads differ from regular processes in the following ways:
内核线程仅在内核模式下运行,而常规进程则在内核模式和用户模式下交替运行。
Kernel threads run only in Kernel Mode, while regular processes run alternatively in Kernel Mode and in User Mode.
由于内核线程仅在内核模式下运行,因此它们仅使用大于的线性地址PAGE_OFFSET。另一方面,常规进程在用户模式或内核模式下使用全部 4 GB 线性地址。
Because kernel threads run only in Kernel Mode, they use
only linear addresses greater than PAGE_OFFSET. Regular processes, on the
other hand, use all four gigabytes of linear addresses, in either
User Mode or Kernel Mode.
该kernel_thread( )
函数创建一个新的内核线程。它接收要执行的内核函数的地址 ( fn)、要传递给该函数的参数 ( arg) 以及一组克隆标志 ( flags) 作为参数。该函数基本上调用do_fork( )如下:
The kernel_thread( )
function creates a new kernel thread. It receives as parameters the
address of the kernel function to be executed (fn), the argument to be passed to that
function (arg), and a set of
clone flags (flags). The function
essentially invokes do_fork( ) as
follows:
do_fork(flags|CLONE_VM|CLONE_UNTRACED, 0, pregs, 0, NULL, NULL);
do_fork(flags|CLONE_VM|CLONE_UNTRACED, 0, pregs, 0, NULL, NULL);
该CLONE_VM标志避免了调用进程的页表的重复:这种重复会浪费时间和内存,因为新的内核线程无论如何都不会访问用户模式地址空间。该CLONE_UNTRACED标志确保没有进程能够跟踪新的内核线程,即使正在跟踪调用进程。
The CLONE_VM flag avoids
the duplication of the page tables of the calling process: this
duplication would be a waste of time and memory, because the new
kernel thread will not access the User Mode address space anyway.
The CLONE_UNTRACED flag ensures
that no process will be able to trace the new kernel thread, even if
the calling process is being traced.
pregs传递给的参数对应do_fork( )于内核模式堆栈中的地址,copy_thread( )函数将在其中找到新线程的CPU寄存器的初始值。该kernel_thread( )函数构建此堆栈区域,以便:
The pregs parameter passed
to do_fork( ) corresponds to the
address in the Kernel Mode stack where the copy_thread( ) function will find the
initial values of the CPU registers for the new thread. The kernel_thread( ) function builds up this
stack area so that:
和寄存器将分别设置为参数ebx和的值。edxcopy_thread()fnarg
The ebx and edx registers will be set by copy_thread() to the values of the
parameters fn and arg, respectively.
寄存器eip将被设置为以下汇编语言片段的地址:
movl %edx,%eax
推 %edx
调用 *%ebx
推入%eax
调用 do_exitThe eip register will
be set to the address of the following assembly language
fragment:
movl %edx,%eax
pushl %edx
call *%ebx
pushl %eax
call do_exit因此,新的内核线程是通过执行该
fn(arg)函数来启动的。如果该函数终止,内核线程将执行_exit( ) 系统调用将返回值传递给它
(请参阅本章后面的“销毁进程fn( )”部分)。
Therefore, the new kernel thread starts by executing the
fn(arg) function. If this
function terminates, the kernel thread executes the _exit( ) system call passing to it the return value of
fn( ) (see the section "Destroying Processes"
later in this chapter).
所有进程的祖先,称为 进程 0,即空闲进程,或者由于历史原因,称为 交换进程,是在 Linux 初始化阶段从头开始创建的内核线程(参见附录 A)。该祖先进程使用以下静态分配的数据结构(所有其他进程的数据结构都是动态分配的):
The ancestor of all processes, called process 0, the idle process, or, for historical reasons, the swapper process, is a kernel thread created from scratch during the initialization phase of Linux (see Appendix A). This ancestor process uses the following statically allocated data structures (data structures for all other processes are dynamically allocated):
存储在变量中的进程描述符init_task,由宏初始化INIT_TASK
。
A process descriptor stored in the init_task variable, which is
initialized by the INIT_TASK
macro.
描述thread_info
符和内核模式堆栈存储在init_thread_union变量中并由宏初始化INIT_THREAD_INFO。
A thread_info
descriptor and a Kernel Mode stack stored in the init_thread_union variable and
initialized by the INIT_THREAD_INFO macro.
进程描述符指向下表:
init_mm
init_fs
init_files
init_signals
init_sighand
这些表分别由以下宏初始化:
INIT_MM
INIT_FS
INIT_FILES
INIT_SIGNALS
INIT_SIGHAND
The following tables, which the process descriptor points to:
init_mm
init_fs
init_files
init_signals
init_sighand
The tables are initialized, respectively, by the following macros:
INIT_MM
INIT_FS
INIT_FILES
INIT_SIGNALS
INIT_SIGHAND
The master kernel Page Global Directory stored in swapper_pg_dir (see the section "Kernel Page Tables"
in Chapter 2).
该start_kernel( )
函数初始化内核所需的所有数据结构,启用中断,并创建另一个内核线程,名为
进程1(更通常称为
init进程) ):
The start_kernel( )
function initializes all the data structures needed by the kernel,
enables interrupts, and creates another kernel thread, named
process 1 (more commonly referred to as the
init process ):
kernel_thread(init, NULL, CLONE_FS|CLONE_SIGHAND);
kernel_thread(init, NULL, CLONE_FS|CLONE_SIGHAND);
新创建的内核线程的 PID 为 1,并与进程 0 共享所有每进程内核数据结构。当被调度程序选择时,init进程开始执行该init( )
函数。
The newly created kernel thread has PID 1 and shares all
per-process kernel data structures with process 0. When selected by
the scheduler, the init process starts
executing the init( )
function.
创建init进程后,进程 0 执行该cpu_idle(
)函数,该函数本质上是重复执行hlt 启用中断的汇编语言指令(参见第 4 章)。仅当该状态没有其他进程时,调度程序才会选择进程 0 TASK_RUNNING
。
After having created the init process,
process 0 executes the cpu_idle(
) function, which essentially consists of repeatedly
executing the hlt assembly language instruction with the interrupts
enabled (see Chapter 4).
Process 0 is selected by the scheduler only when there are no other
processes in the TASK_RUNNING
state.
在多处理器系统中,每个 CPU 都有一个进程 0。计算机开机后,BIOS 立即启动单个 CPU,同时禁用其他 CPU。在 CPU 0 上运行的交换器进程初始化内核数据结构,然后启用其他 CPU 并通过
向其传递值 0 作为新 PID 的函数来创建其他交换器进程。copy_process( )此外,内核将每个分叉进程的描述符cpu字段
设置thread_info为正确的 CPU 索引。
In multiprocessor systems there is a process 0 for each CPU.
Right after the power-on, the BIOS of the computer starts a single
CPU while disabling the others. The swapper process running on CPU 0
initializes the kernel data structures, then enables the other CPUs
and creates the additional swapper processes by
means of the copy_process( )
function passing to it the value 0 as the new PID. Moreover, the
kernel sets the cpu field of the
thread_info descriptor of each
forked process to the proper CPU index.
进程0创建的内核线程执行该
init( )函数,进而完成内核的初始化。然后init( )调用execve( ) 用于加载可执行程序
init的系统调用。结果,
init内核线程成为一个常规进程,拥有自己的每进程内核数据结构(参见第 20 章)。init
进程一直保持活动状态,直到系统关闭,因为它创建并监视实现操作系统外层的所有进程的活动。
The kernel thread created by process 0 executes the
init( ) function, which in turn
completes the initialization of the kernel. Then init( ) invokes the execve( ) system call to load the executable program
init. As a result, the
init kernel thread becomes a regular process
having its own per-process kernel data structure (see Chapter 20). The
init process stays alive until the system is
shut down, because it creates and monitors the activity of all
processes that implement the outer layers of the operating
system.
Linux 使用许多其他内核线程。其中一些是在初始化阶段创建并运行直到关闭;当内核必须执行在其自己的执行上下文中更好地执行的任务时,其他任务是“按需”创建的。
Linux uses many other kernel threads. Some of them are created in the initialization phase and run until shutdown; others are created "on demand," when the kernel must execute a task that is better performed in its own execution context.
内核线程(除了进程 0 和进程 1 之外)的一些示例是:
A few examples of kernel threads (besides process 0 and process 1) are:
执行工作队列中的函数keventd_wq(参见第 4 章)。
Executes the functions in the keventd_wq workqueue (see Chapter 4).
处理与高级电源管理 (APM) 相关的事件。
Handles the events related to the Advanced Power Management (APM).
回收内存,如第 17 章“定期回收”部分所述。
Reclaims memory, as described in the section "Periodic Reclaiming" in Chapter 17.
Flushes "dirty" buffers to disk to reclaim memory, as described in the section "The pdflush Kernel Threads" in Chapter 15.
执行工作队列中的函数kblockd_workqueue。本质上,它定期激活块设备驱动程序,如第 14 章“激活块设备驱动程序”部分所述。
Executes the functions in the kblockd_workqueue workqueue.
Essentially, it periodically activates the block device
drivers, as described in the section "Activating the Block
Device Driver" in Chapter 14.
Runs the tasklets (see section "Softirqs and Tasklets" in Chapter 4); there is one of these kernel threads for each CPU in the system.
[ * ]当内核派生新进程时,父进程可能会移动到另一个 CPU。
[*] The parent process might be moved on to another CPU while the kernel forks the new process.
[ * ]细心的读者可能想知道如何获取 的参数
copy_thread( )值,因为
没有传递给
嵌套函数。正如我们将在第 10 章中看到的,系统调用的参数通常通过将它们的值复制到某个 CPU 寄存器中来传递给内核;因此,这些值与其他寄存器一起保存在内核模式堆栈中。该函数只是查看保存在内核模式堆栈位置中与 的值对应的地址。tlsclone( )tlsdo_fork( )copy_thread( )esi
[*] A careful reader might wonder how copy_thread( ) gets the value of
the tls parameter of
clone( ), because
tls is not passed to
do_fork( ) and nested
functions. As we'll see in Chapter 10, the
parameters of the system calls are usually passed to the
kernel by copying their values into some CPU register; thus,
these values are saved in the Kernel Mode stack together
with the other registers. The copy_thread( ) function just looks
at the address saved in the Kernel Mode stack location
corresponding to the value of esi.
大多数进程“死亡”是指它们终止了它们应该运行的代码的执行。当这种情况发生时,必须通知内核,以便它可以释放进程所拥有的资源;这包括内存、打开的文件以及我们将在本书中遇到的任何其他零碎东西,例如信号量。
Most processes "die" in the sense that they terminate the execution of the code they were supposed to run. When this occurs, the kernel must be notified so that it can release the resources owned by the process; this includes memory, open files, and any other odds and ends that we will encounter in this book, such as semaphores.
进程终止的通常方法是调用exit( )库函数,该函数释放 C 库分配的资源,执行程序员注册的每个函数,最后调用系统调用,将进程从系统中逐出。这exit( )
库函数可以由程序员显式插入。此外,C 编译器总是在函数exit( )的最后一条语句之后插入函数调用main( )
。
The usual way for a process to terminate is to invoke the exit( ) library function, which releases the
resources allocated by the C library, executes each function registered
by the programmer, and ends up invoking a system call that evicts the
process from the system. The exit( )
library function may be inserted by the programmer
explicitly. Additionally, the C compiler always inserts an exit( ) function call right after the last
statement of the main( )
function.
或者,内核可能会强制整个线程组终止。当组中的进程收到无法处理或忽略的信号时(请参阅第 11 章),或者当内核代表该进程运行时,在内核模式下引发不可恢复的 CPU 异常时(请参阅第 4 章),通常会发生这种情况)。
Alternatively, the kernel may force a whole thread group to die. This typically occurs when a process in the group has received a signal that it cannot handle or ignore (see Chapter 11) or when an unrecoverable CPU exception has been raised in Kernel Mode while the kernel was running on behalf of the process (see Chapter 4).
在 Linux 2.6 中,有两个系统调用可以终止用户模式应用程序:
In Linux 2.6 there are two system calls that terminate a User Mode application:
这exit_group( )
系统调用,它终止一个完整的线程组,即整个多线程应用程序。实现该系统调用的主要内核函数称为do_group_exit( )。这是 C 库函数应该调用的系统调用exit()。
The exit_group( )
system call, which terminates a full thread group,
that is, a whole multithreaded application. The main kernel
function that implements this system call is called do_group_exit( ). This is the system
call that should be invoked by the exit() C library function.
这_exit( ) 系统调用,它终止单个进程,而不管受害者线程组中的任何其他进程。实现该系统调用的主要内核函数称为do_exit( )。这是调用的系统调用,例如,pthread_exit( ) LinuxThreads 库的函数。
The _exit( ) system call, which terminates a single process,
regardless of any other process in the thread group of the victim.
The main kernel function that implements this system call is
called do_exit( ). This is the
system call invoked, for instance, by the pthread_exit( ) function of the LinuxThreads library.
该do_group_exit(
)函数杀死属于 线程组的所有进程current。它接收进程终止作为参数code,它可以是系统调用中指定的值
exit_group( )(正常终止),也可以是内核提供的错误代码(异常终止)。该函数执行以下操作:
The do_group_exit(
) function kills all processes belonging to the thread
group of current. It receives as
a parameter the process termination code, which is either a value specified in the
exit_group( ) system call (normal
termination) or an error code supplied by the kernel (abnormal
termination). The function executes the following operations:
检查SIGNAL_GROUP_EXIT退出进程的标志是否不为零,这意味着内核已经为该线程组启动了退出过程。在这种情况下,它会将 中存储的值视为退出代码current->signal->group_exit_code,并跳转到步骤 4。
Checks whether the SIGNAL_GROUP_EXIT flag of the exiting
process is not zero, which means that the kernel already started
an exit procedure for this thread group. In this case, it
considers as exit code the value stored in current->signal->group_exit_code,
and jumps to step 4.
否则,它设置SIGNAL_GROUP_EXIT进程的标志并将终止代码存储在该current->signal->group_exit_code
字段中。
Otherwise, it sets the SIGNAL_GROUP_EXIT flag of the process
and stores the termination code in the current->signal->group_exit_code
field.
调用该zap_other_threads(
)函数来终止 线程组中的其他进程current(如果有)。为此,该函数扫描
PIDTYPE_TGID对应于 的哈希表中的每个 PID 列表current->tgid;对于列表中不同于 的每个进程current,它都会SIGKILL向其发送一个信号(参见第 11 章)。结果,所有此类进程最终都会执行该do_exit( )函数,从而被杀死。
Invokes the zap_other_threads(
) function to kill the other processes in the thread
group of current, if any. In
order to do this, the function scans the per-PID list in the
PIDTYPE_TGID hash table
corresponding to current->tgid; for each process in
the list different from current, it sends a SIGKILL signal to it (see Chapter 11). As a result,
all such processes will eventually execute the do_exit( ) function, and thus they
will be killed.
调用do_exit( )
向其传递进程终止代码的函数。正如我们将在下面看到的,do_exit( )杀死进程并且永远不会返回。
Invokes the do_exit( )
function passing to it the process termination code. As we'll
see below, do_exit( ) kills
the process and never returns.
所有进程终止均由该函数处理do_exit( ),该函数从内核数据结构中删除对终止进程的大多数引用。该do_exit( )函数接收进程终止代码作为参数,并实质上执行以下操作:
All process terminations are handled by the do_exit( ) function, which removes most
references to the terminating process from kernel data structures.
The do_exit( ) function receives
as a parameter the process termination code and essentially executes
the following actions:
PF_EXITING
在进程描述符的字段中设置标志flag以指示该进程正在被消除。
Sets the PF_EXITING
flag in the flag field of the
process descriptor to indicate that the process is being
eliminated.
如有必要,通过该del_timer_sync( )函数从动态计时器队列中删除进程描述符(请参见第 6 章)。
Removes, if necessary, the process descriptor from a
dynamic timer queue via the del_timer_sync( ) function (see Chapter 6).
exit_mm(
)使用 、 、 、、、exit_sem( )和
函数分别从进程描述符中分离与分页、信号量、文件系统、打开文件描述符、命名空间和 I/O 权限位图相关的数据结构。如果没有其他进程共享这些数据结构,这些函数还会删除这些数据结构中的每一个。_ _exit_files( )_ _exit_fs()exit_namespace( )exit_thread( )
Detaches from the process descriptor the data structures
related to paging, semaphores, filesystem, open file
descriptors, namespaces, and I/O Permission Bitmap,
respectively, with the exit_mm(
), exit_sem( ),
_ _exit_files( ), _ _exit_fs(), exit_namespace( ), and exit_thread( ) functions. These
functions also remove each of these data structures if no other
processes are sharing them.
如果实现被终止进程的执行域和可执行格式(参见第 20 章)的内核函数包含在内核模块中,则该函数会减少它们的使用计数器。
If the kernel functions implementing the execution domain and the executable format (see Chapter 20) of the process being killed are included in kernel modules, the function decreases their usage counters.
将进程描述符的字段设置exit_code
为进程终止代码。该值可以是_exit(
)或exit_group( )
系统调用参数(正常终止),也可以是内核提供的错误代码(异常终止)。
Sets the exit_code
field of the process descriptor to the process termination code.
This value is either the _exit(
) or exit_group( )
system call parameter (normal termination), or an error code
supplied by the kernel (abnormal termination).
调用该exit_notify(
)函数执行以下操作:
更新父进程和子进程的父关系。终止进程创建的所有子进程都成为同一线程组中另一个进程(如果有)正在运行的另一个进程的子进程,否则成为init 进程的子进程。
检查exit_signal正在终止的进程的进程描述符字段是否不为
-1,以及该进程是否是其线程组的最后一个成员(请注意,这些条件始终适用于任何正常进程;请参阅前面的描述中的步骤copy_process( )16 “ clone()、fork() 和 vfork() 系统调用”部分)。在这种情况下,该函数向正在终止的进程的父进程发送一个信号(通常SIGCHLD),以通知父进程子进程的死亡。
否则,如果该exit_signal字段等于
-1 或者线程组包含其他进程,则SIGCHLD仅当该进程正在被跟踪时,该函数才向父进程发送信号(在这种情况下,父进程是调试器,从而通知该进程已死亡)轻量级过程)。
如果exit_signal
进程描述符字段等于-1 并且进程未被跟踪,则将exit_state进程描述符的字段设置为EXIT_DEAD,并调用release_task(
)以回收剩余进程数据结构的内存并减少进程描述符的使用计数器(请参阅以下部分)。使用计数器变为等于 1(参见函数中的步骤 3f copy_process( )),因此进程描述符本身不会立即释放。
否则,如果exit_signal进程描述符字段不等于-1 或者进程正在被跟踪,则将该exit_state字段设置为EXIT_ZOMBIE。我们将在下一节中看到僵尸进程会发生什么。
PF_DEAD
在进程描述符的字段中设置标志(请参阅第 7 章中的“ schedule( ) 函数”flags部分)。
Invokes the exit_notify(
) function to perform the following operations:
Updates the parenthood relationships of both the parent process and the child processes. All child processes created by the terminating process become children of another process in the same thread group, if any is running, or otherwise of the init process.
Checks whether the exit_signal process descriptor
field of the process being terminated is different from
-1, and whether the
process is the last member of its thread group (notice that
these conditions always hold for any normal process; see
step 16 in the description of copy_process( ) in the earlier
section "The
clone( ), fork( ), and vfork( ) System Calls"). In
this case, the function sends a signal (usually SIGCHLD) to the parent of the
process being terminated to notify the parent about a
child's death.
Otherwise, if the exit_signal field is equal to
-1 or the thread group
includes other processes, the function sends a SIGCHLD signal to the parent only
if the process is being traced (in this case the parent is
the debugger, which is thus informed of the death of the
lightweight process).
If the exit_signal
process descriptor field is equal to -1 and the process is not being
traced, it sets the exit_state field of the process
descriptor to EXIT_DEAD,
and invokes release_task(
) to reclaim the memory of the remaining process
data structures and to decrease the usage counter of the
process descriptor (see the following section). The usage
counter becomes equal to 1 (see step 3f in the copy_process( ) function), so that
the process descriptor itself is not released right
away.
Otherwise, if the exit_signal process descriptor
field is not equal to -1
or the process is being traced, it sets the exit_state field to EXIT_ZOMBIE. We'll see what
happens to zombie processes in the following section.
Sets the PF_DEAD
flag in the flags field
of the process descriptor (see the section "The schedule( )
Function" in Chapter 7).
调用schedule( )
函数(参见第 7 章)来选择要运行的新进程。由于处于某种状态的进程EXIT_ZOMBIE会被调度程序忽略,因此该进程会在调用switch_to宏 in
后立即停止执行。schedule( )正如我们将在第 7 章中看到的,调度程序将检查该PF_DEAD标志,并减少被替换的僵尸进程描述符中的使用计数器,以表示该进程不再处于活动状态。
Invokes the schedule( )
function (see Chapter
7) to select a new process to run. Because a process in
an EXIT_ZOMBIE state is
ignored by the scheduler, the process stops executing right
after the switch_to macro in
schedule( ) is invoked. As
we'll see in Chapter
7, the scheduler will check the PF_DEAD flag and will decrease the
usage counter in the descriptor of the zombie process being
replaced to denote the fact that the process is no longer
alive.
Unix操作系统允许进程查询内核以获得其父进程的PID或其任何子进程的执行状态。例如,进程可以创建一个子进程来执行特定任务,然后调用一些wait( )类似的库函数来检查子进程是否已终止。如果子进程已终止,其终止代码将告诉父进程任务是否已成功执行。
The Unix operating system allows a process to query the
kernel to obtain the PID of its parent process or the execution state
of any of its children. A process may, for instance, create a child
process to perform a specific task and then invoke some wait( )-like library function to check
whether the child has terminated. If the child has terminated, its
termination code will tell the parent process if the task has been
carried out successfully.
为了符合这些设计选择,Unix 内核不允许在进程终止后立即丢弃进程描述符字段中包含的数据。wait(
)仅当父进程发出类似系统调用(引用已终止进程)后,才允许它们执行此操作。这就是EXIT_ZOMBIE引入状态的原因:虽然进程在技术上是死亡的,但它的描述符必须被保存,直到通知父进程为止。
To comply with these design choices, Unix kernels are not
allowed to discard data included in a process descriptor field right
after the process terminates. They are allowed to do so only after the
parent process has issued a wait(
)-like system call that refers to the terminated process.
This is why the EXIT_ZOMBIE state
has been introduced: although the process is technically dead, its
descriptor must be saved until the parent process is notified.
如果父进程在其子进程之前终止,会发生什么情况?在这种情况下,系统可能会充满僵尸进程,其进程描述符将永远保留在 RAM 中。正如前面提到的,这个问题是通过强制所有孤儿进程成为init进程的子进程来解决的。这样,init进程将销毁僵尸进程,同时通过wait( )类似系统调用检查其合法子进程之一是否终止。
What happens if parent processes terminate before their
children? In such a case, the system could be flooded with zombie
processes whose process descriptors would stay forever in RAM. As
mentioned earlier, this problem is solved by forcing all orphan
processes to become children of the init process.
In this way, the init process will destroy the
zombies while checking for the termination of one of its legitimate
children through a wait( )-like
system call.
该release_task( )函数将最后的数据结构与僵尸进程的描述符分离;它以两种可能的方式应用于僵尸进程:do_exit( )如果父进程对接收来自子进程的信号不感兴趣,则通过函数,或者通过wait4( ) 或者waitpid( )
信号发送到父级后进行系统调用。在后一种情况下,该函数还将回收进程描述符使用的内存,而在前一种情况下,内存回收将由调度程序完成(参见第7 章)。该函数执行以下步骤:
The release_task( ) function
detaches the last data structures from the descriptor of a zombie
process; it is applied on a zombie process in two possible ways: by
the do_exit( ) function if the
parent is not interested in receiving signals from the child, or by
the wait4( ) or waitpid( )
system calls after a signal has been sent to the
parent. In the latter case, the function also will reclaim the memory
used by the process descriptor, while in the former case the memory
reclaiming will be done by the scheduler (see Chapter 7). This function executes
the following steps:
减少属于已终止进程的用户所有者的进程数。该值存储在
user_struct本章前面提到的结构中(请参阅 的步骤 4 copy_process( ))。
Decreases the number of processes belonging to the user
owner of the terminated process. This value is stored in the
user_struct structure mentioned
earlier in the chapter (see step 4 of copy_process( )).
如果正在跟踪该进程,该函数会将其从调试器ptrace_children
列表中删除,并将该进程分配回其原始父进程。
If the process is being traced, the function removes it from
the debugger's ptrace_children
list and assigns the process back to its original parent.
调用_ _exit_signal()
以取消任何挂起的信号并释放signal_struct进程的描述符。如果该描述符不再被其他轻量级进程使用,该函数也会删除该数据结构。此外,该函数调用exit_itimers( )将任何 POSIX 间隔计时器与进程分离。
Invokes _ _exit_signal()
to cancel any pending signal and to release the signal_struct descriptor of the process.
If the descriptor is no longer used by other lightweight
processes, the function also removes this data structure.
Moreover, the function invokes exit_itimers( ) to detach any POSIX
interval timer from the process.
调用_ _exit_sighand()
以摆脱信号处理程序。
Invokes _ _exit_sighand()
to get rid of the signal handlers.
调用_ _unhash_process(
),依次:
将变量减 1 nr_threads。
调用两次以从和类型的哈希表detach_pid( )
中删除进程描述符。pidhashPIDTYPE_PIDPIDTYPE_TGID
如果进程是线程组领导者,则再次调用
两次以从和哈希表detach_pid( )中删除进程描述符。PIDTYPE_PGIDPIDTYPE_SID
使用REMOVE_LINKS
宏从进程列表中取消进程描述符的链接。
Invokes _ _unhash_process(
), which in turn:
Decreases by 1 the nr_threads variable.
Invokes detach_pid( )
twice to remove the process descriptor from the pidhash hash tables of type PIDTYPE_PID and PIDTYPE_TGID.
If the process is a thread group leader, invokes again
detach_pid( ) twice to
remove the process descriptor from the PIDTYPE_PGID and PIDTYPE_SID hash tables.
Uses the REMOVE_LINKS
macro to unlink the process descriptor from the process
list.
如果进程不是线程组领导者,领导者是僵尸进程,并且进程是线程组的最后一个成员,则该函数向领导者的父进程发送信号,通知其进程死亡。
If the process is not a thread group leader, the leader is a zombie, and the process is the last member of the thread group, the function sends a signal to the parent of the leader to notify it of the death of the process.
调用sched_exit( )
调整父进程时间片的函数(这一步逻辑上补充了描述中的步骤17 copy_process( ))
Invokes the sched_exit( )
function to adjust the timeslice of the parent process (this step
logically complements step 17 in the description of copy_process( ))
调用put_task_struct()
以减少进程描述符的使用计数器;如果计数器变为零,该函数将删除对该进程的所有剩余引用:
减少拥有该进程的用户的数据结构的使用计数器(_
_count字段) (请参阅 的步骤 5 ),如果使用计数器变为零,则释放该数据结构。user_structcopy_process( )
释放进程描述符以及用于包含thread_info
描述符和内核模式堆栈的内存区域。
Invokes put_task_struct()
to decrease the process descriptor's usage counter; if the counter
becomes zero, the function drops any remaining reference to the
process:
Decreases the usage counter (_
_count field) of the user_struct data structure of the
user that owns the process (see step 5 of copy_process( )), and releases that
data structure if the usage counter becomes zero.
Releases the process descriptor and the memory area used
to contain the thread_info
descriptor and the Kernel Mode stack.
中断通常被定义为改变处理器执行的指令序列的事件。此类事件对应于电信号由CPU芯片内部和外部的硬件电路产生。
An interrupt is usually defined as an event that alters the sequence of instructions executed by a processor. Such events correspond to electrical signals generated by hardware circuits both inside and outside the CPU chip.
Interrupts are often divided into synchronous and asynchronous interrupts :
同步中断是由CPU控制单元在执行指令时产生的,之所以称为同步是因为控制单元仅在终止指令的执行后才发出它们。
Synchronous interrupts are produced by the CPU control unit while executing instructions and are called synchronous because the control unit issues them only after terminating the execution of an instruction.
异步中断由其他硬件设备在任意时间相对于 CPU 时钟信号生成。
Asynchronous interrupts are generated by other hardware devices at arbitrary times with respect to the CPU clock signals.
Intel微处理器手册指定了同步和异步中断作为例外 和中断,分别。我们将采用这种分类,尽管我们偶尔会使用术语“中断信号”来一起指定这两种类型(同步和异步)。
Intel microprocessor manuals designate synchronous and asynchronous interrupts as exceptions and interrupts, respectively. We'll adopt this classification, although we'll occasionally use the term "interrupt signal" to designate both types together (synchronous as well as asynchronous).
中断由间隔定时器和I/O设备发出;例如,用户击键的到来会引发中断。
Interrupts are issued by interval timers and I/O devices; for instance, the arrival of a keystroke from a user sets off an interrupt.
另一方面,异常是由编程错误或必须由内核处理的异常情况引起的。在第一种情况下,内核通过向当前进程传递每个 Unix 程序员都熟悉的信号来处理异常。在第二种情况下,内核通过汇编语言指令执行从异常情况(例如页面错误或请求)恢复所需的所有步骤,例如int
或者sysenter — 对于内核服务。
Exceptions, on the other hand, are caused either by programming
errors or by anomalous conditions that must be handled by the kernel. In
the first case, the kernel handles the exception by delivering to the
current process one of the signals familiar to every Unix programmer. In
the second case, the kernel performs all the steps needed to recover from
the anomalous condition, such as a Page Fault or a request—via an assembly
language instruction such as int
or sysenter —for a kernel service.
我们首先在下一节中描述引入此类信号的动机。然后我们将展示众所周知的 I/O 设备发出的 IRQ(中断请求)如何产生中断,并详细介绍 80 × 86 处理器如何在硬件级别处理中断和异常。然后我们在“初始化中断描述符表”一节中说明Linux如何初始化80×86中断架构所需的所有数据结构。其余三节描述 Linux 如何在软件级别处理中断信号。
We start by describing in the next section the motivation for introducing such signals. We then show how the well-known IRQs (Interrupt ReQuests) issued by I/O devices give rise to interrupts, and we detail how 80 × 86 processors handle interrupts and exceptions at the hardware level. Then we illustrate, in the section "Initializing the Interrupt Descriptor Table," how Linux initializes all the data structures required by the 80×86 interrupt architecture. The remaining three sections describe how Linux handles interrupt signals at the software level.
在继续之前请注意一点:在本章中,我们仅介绍所有 PC 通用的“经典”中断;我们不讨论某些架构的非标准中断。
One word of caution before moving on: in this chapter, we cover only "classic" interrupts common to all PCs; we do not cover the nonstandard interrupts of some architectures.
顾名思义,中断信号提供了一种将处理器转移到正常控制流之外的代码的方法。当中断信号到来时,CPU必须停止当前正在做的事情并切换到新的活动;eip它通过将程序计数器的当前值(即和寄存器的内容cs
)保存在内核模式堆栈中并将与中断类型相关的地址放入程序计数器来实现此目的。
As the name suggests, interrupt signals provide a way to
divert the processor to code outside the normal flow of control. When an
interrupt signal arrives, the CPU must stop what it's currently doing
and switch to a new activity; it does this by saving the current value
of the program counter (i.e., the content of the eip and cs
registers) in the Kernel Mode stack and by placing an address related to
the interrupt type into the program counter.
本章中的一些内容会让您想起上一章中描述的上下文切换,当内核用一个进程替换另一个进程时执行上下文切换。但中断处理和进程切换之间有一个关键区别:由中断或异常处理程序执行的代码不是进程。相反,它是一个内核控制路径,以中断发生时正在运行的同一进程为代价运行(请参阅后面的“异常和中断处理程序的嵌套执行”部分)。作为内核控制路径,中断处理程序比进程更轻(它具有更少的上下文并且需要更少的时间来设置或拆除)。
There are some things in this chapter that will remind you of the context switch described in the previous chapter, carried out when a kernel substitutes one process for another. But there is a key difference between interrupt handling and process switching: the code executed by an interrupt or by an exception handler is not a process. Rather, it is a kernel control path that runs at the expense of the same process that was running when the interrupt occurred (see the later section "Nested Execution of Exception and Interrupt Handlers"). As a kernel control path, the interrupt handler is lighter than a process (it has less context and requires less time to set up or tear down).
中断处理是内核执行的最敏感的任务之一,因为它必须满足以下约束:
Interrupt handling is one of the most sensitive tasks performed by the kernel, because it must satisfy the following constraints:
当内核可能想要完成它试图做的其他事情时,中断可能随时发生。因此,内核的目标是尽快消除中断并尽可能推迟处理。例如,假设一个数据块已到达网络线路。当硬件中断内核时,它可以简单地标记数据的存在,让处理器返回到之前运行的任何内容,并稍后执行其余的处理(例如将数据移动到接收进程可以找到它的缓冲区中) ,然后重新启动该过程)。
Interrupts can come anytime, when the kernel may want to finish something else it was trying to do. The kernel's goal is therefore to get the interrupt out of the way as soon as possible and defer as much processing as it can. For instance, suppose a block of data has arrived on a network line. When the hardware interrupts the kernel, it could simply mark the presence of data, give the processor back to whatever was running before, and do the rest of the processing later (such as moving the data into a buffer where its recipient process can find it, and then restarting the process). The activities that the kernel needs to perform in response to an interrupt are thus divided into a critical urgent part that the kernel executes right away and a deferrable part that is left for later.
由于中断可能随时发生,因此内核可能正在处理其中一个中断,同时发生另一个中断(不同类型的中断)。应尽可能允许这种情况,因为它会使 I/O 设备保持忙碌(请参阅后面的“异常和中断处理程序的嵌套执行”部分)。因此,必须对中断处理程序进行编码,以便相应的内核控制路径可以以嵌套的方式执行。当最后一个内核控制路径终止时,如果中断信号导致重新调度活动,则内核必须能够恢复中断进程的执行或切换到另一个进程。
Because interrupts can come anytime, the kernel might be handling one of them while another one (of a different type) occurs. This should be allowed as much as possible, because it keeps the I/O devices busy (see the later section "Nested Execution of Exception and Interrupt Handlers"). As a result, the interrupt handlers must be coded so that the corresponding kernel control paths can be executed in a nested manner. When the last kernel control path terminates, the kernel must be able to resume execution of the interrupted process or switch to another process if the interrupt signal has caused a rescheduling activity.
尽管内核在处理前一个中断信号时可能会接受新的中断信号,但内核代码内存在一些必须禁用中断的关键区域。必须尽可能地限制此类关键区域,因为根据前面的要求,内核,特别是中断处理程序,应该在大部分时间都在启用中断的情况下运行。
Although the kernel may accept a new interrupt signal while handling a previous one, some critical regions exist inside the kernel code where interrupts must be disabled. Such critical regions must be limited as much as possible because, according to the previous requirement, the kernel, and particularly the interrupt handlers, should run most of the time with the interrupts enabled.
The Intel documentation classifies interrupts and exceptions as follows:
中断:
Interrupts:
All Interrupt Requests (IRQs) issued by I/O devices give rise to maskable interrupts . A maskable interrupt can be in two states: masked or unmasked; a masked interrupt is ignored by the control unit as long as it remains masked.
Only a few critical events (such as hardware failures) give rise to nonmaskable interrupts . Nonmaskable interrupts are always recognized by the CPU.
例外情况:
当 CPU 在执行指令时检测到异常情况时生成。eip根据CPU 控制单元引发异常时保存在内核模式堆栈上的寄存器的值,它们进一步分为三组。
一般可以纠正;一旦更正,程序就可以重新启动,而不会失去连续性。保存的值eip是导致错误的指令的地址,因此当异常处理程序终止时可以恢复该指令。正如我们将在第 9 章的“页面错误异常处理程序”部分中看到的,只要处理程序能够纠正导致异常的异常情况,就必须恢复相同的指令。
执行捕获指令后立即报告;内核将控制权返回给程序后,允许程序继续执行而不会失去连续性。保存的值eip是应在导致陷阱的指令之后执行的指令的地址。仅当不需要重新执行终止的指令时才会触发陷阱。陷阱的主要用途是用于调试目的。在这种情况下,中断信号的作用是通知调试器特定指令已被执行(例如,已到达程序内的断点)。一旦用户检查了调试器提供的数据,她可以要求被调试程序的执行从下一条指令开始恢复。
发生严重错误;控制单元有麻烦,它可能无法在寄存器中存储eip导致异常的指令的精确位置。中止用于报告严重错误,例如硬件故障以及系统表中的值无效或不一致。控制单元发送的中断信号是紧急信号,用于将控制权切换到相应的中止异常处理程序。该处理程序别无选择,只能强制受影响的进程终止。
应程序员的要求发生。它们被触发int 或者int3
指示; 这into (检查是否溢出)和bound (检查地址绑定)指令在检查的条件不成立时也会引发编程异常。编程异常由控制单元作为陷阱处理;它们通常称为
软件中断 。此类异常有两个常见用途:实现系统调用和向调试器通知特定事件(请参阅第 10 章)。
Exceptions:
Generated when the CPU detects an anomalous condition
while executing an instruction. These are further divided into
three groups, depending on the value of the eip register that is saved on the
Kernel Mode stack when the CPU control unit raises the
exception.
Can generally be corrected; once corrected, the program
is allowed to restart with no loss of continuity. The saved
value of eip is the address
of the instruction that caused the fault, and hence that
instruction can be resumed when the exception handler
terminates. As we'll see in the section "Page Fault Exception
Handler" in Chapter
9, resuming the same instruction is necessary whenever
the handler is able to correct the anomalous condition that
caused the exception.
Reported immediately following the execution of the
trapping instruction; after the kernel returns control to the
program, it is allowed to continue its execution with no loss
of continuity. The saved value of eip is the address of the
instruction that should be executed after the one that caused
the trap. A trap is triggered only when there is no need to
reexecute the instruction that terminated. The main use of
traps is for debugging purposes. The role of the interrupt
signal in this case is to notify the debugger that a specific
instruction has been executed (for instance, a breakpoint has
been reached within a program). Once the user has examined the
data provided by the debugger, she may ask that execution of
the debugged program resume, starting from the next
instruction.
A serious error occurred; the control unit is in
trouble, and it may be unable to store in the eip register the precise location of
the instruction causing the exception. Aborts are used to
report severe errors, such as hardware failures and invalid or
inconsistent values in system tables. The interrupt signal
sent by the control unit is an emergency signal used to switch
control to the corresponding abort exception handler. This
handler has no choice but to force the affected process to
terminate.
Occur at the request of the programmer. They are
triggered by int or int3
instructions; the into (check for overflow) and bound (check on address bound) instructions also give
rise to a programmed exception when the condition they are
checking is not true. Programmed exceptions are handled by the
control unit as traps; they are often called
software interrupts . Such exceptions have two common uses: to
implement system calls and to notify a debugger of a specific
event (see Chapter
10).
每个中断或异常均由 0 到 255 范围内的数字标识;Intel 将这个 8 位无符号数称为 向量。向量 不可屏蔽中断和异常的数量是固定的,而可屏蔽中断的数量可以通过对中断控制器进行编程来更改(请参阅下一节)。
Each interrupt or exception is identified by a number ranging from 0 to 255; Intel calls this 8-bit unsigned number a vector. The vectors of nonmaskable interrupts and exceptions are fixed, while those of maskable interrupts can be altered by programming the Interrupt Controller (see the next section).
每个能够发出中断请求的硬件设备控制器通常都有一条指定为中断请求(IRQ)线的输出线。[ * ]所有现有的 IRQ 线都连接到称为可编程中断控制器的硬件电路的输入引脚,该硬件电路执行以下操作:
Each hardware device controller capable of issuing interrupt requests usually has a single output line designated as the Interrupt ReQuest (IRQ) line.[*] All existing IRQ lines are connected to the input pins of a hardware circuit called the Programmable Interrupt Controller, which performs the following actions:
监视 IRQ 线路,检查是否有升高的信号。如果有两条或多条 IRQ 线被拉起,则选择具有较低引脚编号的一条。
Monitors the IRQ lines, checking for raised signals. If two or more IRQ lines are raised, selects the one having the lower pin number.
如果 IRQ 线上出现上升信号:
将接收到的升高信号转换为相应的矢量。
将向量存储在中断控制器 I/O 端口中,从而允许 CPU 通过数据总线读取它。
向处理器 INTR 引脚发送一个上升信号,即发出一个中断。
等待 CPU 通过写入可编程中断控制器( PIC ) I/O 端口之一来确认中断信号;当这种情况发生时,清除 INTR 线。
If a raised signal occurs on an IRQ line:
Converts the raised signal received into a corresponding vector.
Stores the vector in an Interrupt Controller I/O port, thus allowing the CPU to read it via the data bus.
Sends a raised signal to the processor INTR pin—that is, issues an interrupt.
Waits until the CPU acknowledges the interrupt signal by writing into one of the Programmable Interrupt Controllers (PIC) I/O ports; when this occurs, clears the INTR line.
返回步骤 1。
Goes back to step 1.
IRQ线从0开始按顺序编号;因此,第一条 IRQ 线通常表示为 IRQ 0。Intel 与 IRQ n关联的默认向量是 n +32。如前所述,可以通过向中断控制器端口发出合适的 I/O 指令来修改 IRQ 和向量之间的映射。
The IRQ lines are sequentially numbered starting from 0; therefore, the first IRQ line is usually denoted as IRQ 0. Intel's default vector associated with IRQ n is n+32. As mentioned before, the mapping between IRQs and vectors can be modified by issuing suitable I/O instructions to the Interrupt Controller ports.
每条 IRQ 线都可以选择性地禁用。因此,可以对 PIC 进行编程以禁用 IRQ。也就是说,可以告诉 PIC 停止发出中断引用给定的 IRQ 线,或恢复发出它们。禁用的中断不会丢失;一旦它们再次启用,PIC 就会将它们发送到 CPU。大多数中断处理程序都使用此功能,因为它允许它们串行处理相同类型的 IRQ。
Each IRQ line can be selectively disabled. Thus, the PIC can be programmed to disable IRQs. That is, the PIC can be told to stop issuing interrupts that refer to a given IRQ line, or to resume issuing them. Disabled interrupts are not lost; the PIC sends them to the CPU as soon as they are enabled again. This feature is used by most interrupt handlers, because it allows them to process IRQs of the same type serially.
选择性启用/禁用 IRQ 与全局屏蔽/取消屏蔽可屏蔽中断不同。当IF寄存器的标志eflags清零时,PIC 发出的每个可屏蔽中断都会被 CPU 暂时忽略。这
cli 和sti 汇编语言指令分别清除和设置该标志。
Selective enabling/disabling of IRQs is not the same as global
masking/unmasking of maskable interrupts. When the IF flag of the eflags register is clear, each maskable
interrupt issued by the PIC is temporarily ignored by the CPU. The
cli and sti assembly language instructions, respectively, clear and
set that flag.
传统的 PIC 是通过“级联”连接两个 8259A 型外部芯片来实现的。每个芯片最多可以处理八个不同的 IRQ 输入线。由于从 PIC 的 INT 输出线连接到主 PIC 的 IRQ 2 引脚,因此可用 IRQ 线的数量限制为 15。
Traditional PICs are implemented by connecting "in cascade" two 8259A-style external chips. Each chip can handle up to eight different IRQ input lines. Because the INT output line of the slave PIC is connected to the IRQ 2 pin of the master PIC, the number of available IRQ lines is limited to 15.
前面的描述涉及为单处理器系统设计的 PIC。如果系统包含单个 CPU,则主 PIC 的输出线可以直接连接到 CPU 的 INTR 引脚。但是,如果系统包含两个或更多 CPU,则这种方法不再有效,并且需要更复杂的 PIC。
The previous description refers to PICs designed for uniprocessor systems. If the system includes a single CPU, the output line of the master PIC can be connected in a straightforward way to the INTR pin the CPU. However, if the system includes two or more CPUs, this approach is no longer valid and more sophisticated PICs are needed.
能够向系统中的每个 CPU 提供中断对于充分利用 SMP 的并行性至关重要建筑学。因此,英特尔从 Pentium III 开始引入了一个新组件,称为 I/O 高级可编程中断控制器 ( I/O APIC )。该芯片是旧的8259A可编程中断控制器的高级版本;为了支持旧操作系统,最近的主板包括这两种类型的芯片。此外,当前所有 80 × 86 微处理器都包含 本地 APIC。每个本地APIC都有32位寄存器,一个内部时钟;本地定时器装置;以及两条额外的 IRQ 线 LINT 0 和 LINT 1,为本地 APIC 中断保留。所有本地 APIC 都连接到外部 I/O APIC,从而形成多 APIC 系统。
Being able to deliver interrupts to each CPU in the system is crucial for fully exploiting the parallelism of the SMP architecture. For that reason, Intel introduced starting with Pentium III a new component designated as the I/O Advanced Programmable Interrupt Controller (I/O APIC). This chip is the advanced version of the old 8259A Programmable Interrupt Controller; to support old operating systems, recent motherboards include both types of chip. Moreover, all current 80 × 86 microprocessors include a local APIC. Each local APIC has 32-bit registers, an internal clock; a local timer device; and two additional IRQ lines, LINT 0 and LINT 1, reserved for local APIC interrupts. All local APICs are connected to an external I/O APIC, giving rise to a multi-APIC system.
图4-1 示意性地说明了多APIC系统的结构。APIC 总线将“前端”I/O APIC 连接到本地 APIC。来自设备的 IRQ 线路连接到 I/O APIC,因此 I/O APIC 充当本地 APIC 的路由器。在Pentium III及早期处理器的主板中,APIC总线是串行三线总线;从Pentium 4开始,APIC总线是通过系统总线的方式实现的。然而,由于 APIC 总线及其消息对软件来说是不可见的,因此我们不会提供更多细节。
Figure 4-1 illustrates in a schematic way the structure of a multi-APIC system. An APIC bus connects the "frontend" I/O APIC to the local APICs. The IRQ lines coming from the devices are connected to the I/O APIC, which therefore acts as a router with respect to the local APICs. In the motherboards of the Pentium III and earlier processors, the APIC bus was a serial three-line bus; starting with the Pentium 4, the APIC bus is implemented by means of the system bus. However, because the APIC bus and its messages are invisible to software, we won't give further details.
I/O APIC 由一组 24 条 IRQ 线、一个 24 项中断 重定向表、可编程寄存器以及用于通过 APIC 总线发送和接收 APIC 消息的消息单元组成。与 8259A 的 IRQ 引脚不同,中断优先级与引脚编号无关:重定向表中的每个条目都可以单独编程,以指示中断向量和优先级、目标处理器以及如何选择处理器。重定向表中的信息用于将每个外部 IRQ 信号转换为通过 APIC 总线发送到一个或多个本地 APIC 单元的消息。
The I/O APIC consists of a set of 24 IRQ lines, a 24-entry Interrupt Redirection Table, programmable registers, and a message unit for sending and receiving APIC messages over the APIC bus. Unlike IRQ pins of the 8259A, interrupt priority is not related to pin number: each entry in the Redirection Table can be individually programmed to indicate the interrupt vector and priority, the destination processor, and how the processor is selected. The information in the Redirection Table is used to translate each external IRQ signal into a message to one or more local APIC units via the APIC bus.
来自外部硬件设备的中断请求可以通过两种方式在可用 CPU 之间分配:
Interrupt requests coming from external hardware devices can be distributed among the available CPUs in two ways:
IRQ 信号被传送到相应重定向表条目中列出的本地 APIC。中断被传送到一个特定的 CPU、CPU 的子集或同时传送到所有 CPU(广播模式)。
The IRQ signal is delivered to the local APICs listed in the corresponding Redirection Table entry. The interrupt is delivered to one specific CPU, to a subset of CPUs, or to all CPUs at once (broadcast mode).
IRQ 信号被传送到正在执行具有最低优先级的进程的处理器的本地 APIC。
每个本地APIC都有一个可编程任务优先级寄存器(TPR),用于计算当前运行进程的优先级。英特尔希望每个进程切换都会在操作系统内核中修改该寄存器。
如果两个或多个 CPU 共享最低优先级,则使用仲裁技术在它们之间分配负载 。每个CPU在本地APIC的仲裁优先级寄存器中被分配了不同的仲裁优先级,范围从0(最低)到15(最高)。
每次向某个CPU发送中断时,其相应的仲裁优先级会自动设置为0,而任何其他CPU的仲裁优先级都会增加。当仲裁优先级寄存器大于15时,它被设置为获胜CPU的先前仲裁优先级加1。因此,中断以循环方式分布在具有相同任务优先级的CPU之间。[ * ]
The IRQ signal is delivered to the local APIC of the processor that is executing the process with the lowest priority.
Every local APIC has a programmable task priority register (TPR), which is used to compute the priority of the currently running process. Intel expects this register to be modified in an operating system kernel by each process switch.
If two or more CPUs share the lowest priority, the load is distributed between them using a technique called arbitration . Each CPU is assigned a different arbitration priority ranging from 0 (lowest) to 15 (highest) in the arbitration priority register of the local APIC.
Every time an interrupt is delivered to a CPU, its corresponding arbitration priority is automatically set to 0, while the arbitration priority of any other CPU is increased. When the arbitration priority register becomes greater than 15, it is set to the previous arbitration priority of the winning CPU increased by 1. Therefore, interrupts are distributed in a round-robin fashion among CPUs with the same task priority.[*]
除了在处理器之间分配中断外,多 APIC 系统还允许 CPU 生成处理器间中断 。当一个CPU希望向另一个CPU发送中断时,它将中断向量和目标本地APIC的标识符存储在其自己本地APIC的中断命令寄存器(ICR)中。然后,一条消息通过 APIC 总线发送到目标的本地 APIC,从而向其自己的 CPU 发出相应的中断。
Besides distributing interrupts among processors, the multi-APIC system allows CPUs to generate interprocessor interrupts . When a CPU wishes to send an interrupt to another CPU, it stores the interrupt vector and the identifier of the target's local APIC in the Interrupt Command Register (ICR) of its own local APIC. A message is then sent via the APIC bus to the target's local APIC, which therefore issues a corresponding interrupt to its own CPU.
处理器间中断(简称 IPI)是 SMP 的重要组成部分建筑学。Linux 主动使用它们在 CPU 之间交换消息(请参阅本章后面的内容)。
Interprocessor interrupts (in short, IPIs) are a crucial component of the SMP architecture. They are actively used by Linux to exchange messages among CPUs (see later in this chapter).
当前的许多单处理器系统都包含 I/O APIC 芯片,可以通过两种不同的方式对其进行配置:
Many of the current uniprocessor systems include an I/O APIC chip, which may be configured in two distinct ways:
作为连接到CPU的标准8259A型外部PIC。本地 APIC 被禁用,两条 LINT 0 和 LINT 1 本地 IRQ 线分别配置为 INTR 和 NMI 引脚。
As a standard 8259A-style external PIC connected to the CPU. The local APIC is disabled and the two LINT 0 and LINT 1 local IRQ lines are configured, respectively, as the INTR and NMI pins.
作为标准外部 I/O APIC。本地 APIC 已启用,所有外部中断均通过 I/O APIC 接收。
As a standard external I/O APIC. The local APIC is enabled, and all external interrupts are received through the I/O APIC.
80×86 微处理器发出大约 20 种不同的异常。[ * ]内核必须为每种异常类型提供专用的异常处理程序。对于某些异常,CPU 控制单元还会生成硬件错误代码,并在启动异常处理程序之前将其推送到内核模式堆栈上。
The 80×86 microprocessors issue roughly 20 different exceptions .[*] The kernel must provide a dedicated exception handler for each exception type. For some exceptions, the CPU control unit also generates a hardware error code and pushes it on the Kernel Mode stack before starting the exception handler.
下面的列表给出了向量、名称、类型以及在 80×86 处理器中发现的异常的简要描述。其他信息可在英特尔技术文档中找到。
The following list gives the vector, the name, the type, and a brief description of the exceptions found in 80×86 processors. Additional information may be found in the Intel technical documentation.
当程序发出整数除以 0 时引发。
Raised when a program issues an integer division by 0.
TF当国旗升起时eflags 已设置(对于实现
单步执行非常有用 被调试程序的地址)或当指令或操作数的地址落在活动调试寄存器的范围内时(请参见第 3 章中的“硬件上下文”部分)。
Raised when the TF flag
of eflags is set (quite useful to implement
single-step execution of a debugged program) or when the address of an
instruction or operand falls within the range of an active debug
register (see the section "Hardware Context"
in Chapter 3).
为不可屏蔽中断(使用 NMI 引脚的中断)保留。
Reserved for nonmaskable interrupts (those that use the NMI pin).
Caused by an int3
(breakpoint) instruction (usually inserted by a
debugger).
An into (check for overflow) instruction has been
executed while the OF
(overflow) flag of eflags is
set.
A bound (check on address bound) instruction is executed
with the operand outside of the valid address bounds.
CPU 执行单元检测到无效操作码(机器指令中确定所执行操作的部分)。
The CPU execution unit has detected an invalid opcode (the part of the machine instruction that determines the operation performed).
已执行 ESCAPE、MMX 或 SSE/SSE2 指令,且标志TS为cr0 设置(请参阅第 3 章中的“保存和加载 FPU、MMX 和 XMM 寄存器”部分)。
An ESCAPE, MMX, or SSE/SSE2 instruction has been executed
with the TS flag of cr0 set (see the section "Saving and Loading the
FPU, MMX, and XMM Registers" in Chapter 3).
通常,当CPU在尝试调用先前异常的处理程序时检测到异常时,可以串行处理这两个异常。然而,在少数情况下,处理器无法串行处理它们,因此会引发此异常。
Normally, when the CPU detects an exception while trying to call the handler for a prior exception, the two exceptions can be handled serially. In a few cases, however, the processor cannot handle them serially, so it raises this exception.
外部数学协处理器的问题(仅适用于旧的 80386 微处理器)。
Problems with the external mathematical coprocessor (applies only to old 80386 microprocessors).
CPU 尝试将上下文切换到具有无效任务状态段的进程。
The CPU has attempted a context switch to a process having an invalid Task State Segment.
引用了内存中不存在的段(其中Segment-Present段描述符的标志被清除)。
A reference was made to a segment not present in memory
(one in which the Segment-Present flag of the Segment
Descriptor was cleared).
该指令试图超出堆栈段限制,或者 所标识的段ss不存在于内存中。
The instruction attempted to exceed the stack segment
limit, or the segment identified by ss is not present in memory.
违反了80×86保护模式下的保护规则之一。
One of the protection rules in the protected mode of the 80×86 has been violated.
内存中不存在寻址的页面,相应的页表条目为空,或者发生了违反分页保护机制的情况。
The addressed page is not present in memory, the corresponding Page Table entry is null, or a violation of the paging protection mechanism has occurred.
集成到 CPU 芯片中的浮点单元发出错误信号,例如数字溢出或除以 0。[ * ]
The floating-point unit integrated into the CPU chip has signaled an error condition, such as numeric overflow or division by 0.[*]
操作数的地址未正确对齐(例如,长整数的地址不是 4 的倍数)。
The address of an operand is not correctly aligned (for instance, the address of a long integer is not a multiple of 4).
机器检查机制检测到 CPU 或总线错误。
A machine-check mechanism has detected a CPU or bus error.
CPU 芯片中集成的 SSE 或 SSE2 单元已发出浮点运算错误条件信号。
The SSE or SSE2 unit integrated in the CPU chip has signaled an error condition on a floating-point operation.
20到31的值是Intel为未来开发保留的。如表 4-1所示,每个异常都由特定的异常处理程序处理(请参阅本章后面的“异常处理”部分),该异常处理程序通常向引发异常的进程发送一个 Unix 信号。
The values from 20 to 31 are reserved by Intel for future development. As illustrated in Table 4-1, each exception is handled by a specific exception handler (see the section "Exception Handling" later in this chapter), which usually sends a Unix signal to the process that caused the exception.
表 4-1。异常处理程序发送的信号
Table 4-1. Signals sent by the exception handlers
# # | 例外 Exception | 异常处理程序 Exception handler | 信号 Signal |
|---|---|---|---|
0 0 | 除法误差 Divide error | | |
1 1 | 调试 Debug | | |
2 2 | 国家管理研究所 NMI | | 没有任何 None |
3 3 | 断点 Breakpoint | | |
4 4 | 溢出 Overflow | | |
5 5 | 边界检查 Bounds check | | |
6 6 | 无效操作码 Invalid opcode | | |
7 7 | 设备不可用 Device not available | | 没有任何 None |
8 8 | 双误 Double fault | | 没有任何 None |
9 9 | 协处理器段溢出 Coprocessor segment overrun | | |
10 10 | 无效的 TSS Invalid TSS | | |
11 11 | 段不存在 Segment not present | | |
12 12 | 堆栈段故障 Stack segment fault | | |
13 13 | 一般保护 General protection | | |
14 14 | 页面错误 Page Fault | | |
15 15 | 英特尔保留 Intel-reserved | 没有任何 None | 没有任何 None |
16 16 | 浮点错误 Floating-point error | | |
17 号 17 | 对准检查 Alignment check | | |
18 18 | 机器检查 Machine check | | 没有任何 None |
19 19 | SIMD浮点 SIMD floating point | | |
称为中断描述符表(IDT)的系统表) 将每个中断或异常向量与相应中断或异常处理程序的地址相关联。在内核启用中断之前,必须正确初始化 IDT。
A system table called Interrupt Descriptor Table (IDT ) associates each interrupt or exception vector with the address of the corresponding interrupt or exception handler. The IDT must be properly initialized before the kernel enables interrupts.
IDT 格式类似于第 2 章中讨论的 GDT 和 LDT 。每个条目对应一个中断或异常向量,并由 8 字节描述符组成。因此,最多需要 256 × 8 = 2048 字节来存储 IDT。
The IDT format is similar to that of the GDT and the LDTs examined in Chapter 2. Each entry corresponds to an interrupt or an exception vector and consists of an 8-byte descriptor. Thus, a maximum of 256 × 8 = 2048 bytes are required to store the IDT.
这idtr CPU 寄存器允许 IDT 位于内存中的任何位置:它指定 IDT 基物理地址及其限制(最大长度)。在启用中断之前必须使用以下命令对其进行初始化:lidt 汇编语言指令。
The idtr CPU register allows the IDT to be located anywhere in
memory: it specifies both the IDT base physical address and its limit
(maximum length). It must be initialized before enabling interrupts by
using the lidt assembly language instruction.
IDT可以包括三种类型的描述符;图 4-2说明了其中每个 64 位的含义。Type特别是,位 40-43 中编码的字段值标识描述符类型。
The IDT may include three types of descriptors; Figure 4-2 illustrates the
meaning of the 64 bits included in each of them. In particular, the
value of the Type field encoded in
the bits 40–43 identifies the descriptor type.
描述符是:
The descriptors are:
包括当中断信号发生时必须替换当前进程的 TSS 选择器。
Includes the TSS selector of the process that must replace the current one when an interrupt signal occurs.
包括段选择器以及中断或异常处理程序段内的偏移量。当将控制权转移到正确的段时,处理器会清除该IF标志,从而禁用进一步的可屏蔽中断。
Includes the Segment Selector and the offset inside the
segment of an interrupt or exception handler. While transferring
control to the proper segment, the processor clears the IF flag, thus disabling further
maskable interrupts.
与中断门类似,不同之处在于,在将控制转移到正确的段时,处理器不会修改标志IF
。
Similar to an interrupt gate, except that while
transferring control to the proper segment, the processor does
not modify the IF
flag.
正如我们将在后面的“中断、陷阱和系统门”部分中看到的,Linux 使用中断门处理中断和陷阱门来处理异常情况。[ * ]
As we'll see in the later section "Interrupt, Trap, and System Gates," Linux uses interrupt gates to handle interrupts and trap gates to handle exceptions.[*]
现在我们描述CPU控制单元如何处理中断和异常。我们假设内核已经初始化,因此CPU工作在保护模式下。
We now describe how the CPU control unit handles interrupts and exceptions. We assume that the kernel has been initialized, and thus the CPU is operating in Protected Mode.
执行一条指令后,cs和eip寄存器对包含下一条要执行的指令的逻辑地址。在处理该指令之前,控制单元检查在控制单元执行前一条指令时是否发生中断或异常。如果发生这种情况,控制单元将执行以下操作:
After executing an instruction, the cs and eip pair of registers contain the logical
address of the next instruction to be executed. Before dealing with
that instruction, the control unit checks whether an interrupt or an
exception occurred while the control unit executed the previous
instruction. If one occurred, the control unit does the
following:
确定与中断或异常关联的向量i (0 ≤ i ≤ 255)。
Determines the vector i (0 ≤ i ≤ 255) associated with the interrupt or the exception.
读取寄存器引用的 IDT 的第 iidtr个条目(我们在下面的描述中假设该条目包含中断或陷阱门)。
Reads the i th entry of the IDT
referred by the idtr register
(we assume in the following description that the entry contains an
interrupt or a trap gate).
获取GDT的基地址gdtr 寄存器并在 GDT 中查找以读取由 IDT 条目中的选择器标识的段描述符。该描述符指定包含中断或异常处理程序的段的基地址。
Gets the base address of the GDT from the gdtr register and looks in the GDT to read the Segment
Descriptor identified by the selector in the IDT entry. This
descriptor specifies the base address of the segment that includes
the interrupt or exception handler.
确保中断是由授权来源发出的。首先,它将存储在寄存器的两个最低有效位中的当前特权级别 (CPL)cs与描述符特权级别 (DPL)进行比较) 包含在 GDT 中的段描述符。提出“一般保护“如果 CPL 低于 DPL,则异常,因为中断处理程序不能具有比引起中断的程序更低的特权。对于编程的异常,进行进一步的安全检查:将 CPL 与包含在其中的门描述符的 DPL 进行比较如果 DPL 低于 CPL,则 IDT 会引发“一般保护”异常。最后的检查可以防止用户应用程序访问特定的陷阱或中断门。
Makes sure the interrupt was issued by an authorized source.
First, it compares the Current Privilege Level (CPL), which is
stored in the two least significant bits of the cs register, with the Descriptor
Privilege Level (DPL ) of the Segment Descriptor included in the GDT.
Raises a "General protection " exception if the CPL is lower than the DPL,
because the interrupt handler cannot have a lower privilege than
the program that caused the interrupt. For programmed exceptions,
makes a further security check: compares the CPL with the DPL of
the gate descriptor included in the IDT and raises a "General
protection" exception if the DPL is lower than the CPL. This last
check makes it possible to prevent access by user applications to
specific trap or interrupt gates.
检查权限级别是否发生更改,即 CPL 是否与所选段描述符的 DPL 不同。如果是这样,控制单元必须开始使用与新特权级别关联的堆栈。它通过执行以下步骤来做到这一点:
Checks whether a change of privilege level is taking place — that is, if CPL is different from the selected Segment Descriptor's DPL. If so, the control unit must start using the stack that is associated with the new privilege level. It does this by performing the following steps:
Reads the tr
register to access the TSS segment of the
running process.
Loads the ss and
esp registers with the
proper values for the stack segment and stack pointer
associated with the new privilege level. These values are
found in the TSS (see the section "Task State
Segment" in Chapter
3).
In the new stack, it saves the previous values of
ss and esp, which define the logical
address of the stack associated with the old privilege
level.
如果发生故障,它将加载cs导致eip异常的指令的逻辑地址,以便可以再次执行。
If a fault has occurred, it loads cs and eip with the logical address of the
instruction that caused the exception so that it can be executed
again.
如果异常携带硬件错误代码,则将其保存在堆栈中。
If the exception carries a hardware error code, it saves it on the stack.
分别加载cs和,以及存储在IDT 第ieip个条目中的门描述符的段选择器和偏移字段。这些值定义中断或异常处理程序的第一条指令的逻辑地址。
Loads cs and eip, respectively, with the Segment
Selector and the Offset fields of the Gate Descriptor stored in
the i th entry of the IDT. These values
define the logical address of the first instruction of the
interrupt or exception handler.
控制单元执行的最后一步相当于跳转到中断或异常处理程序。换句话说,控制单元在处理完中断信号后所处理的指令是所选择的处理程序的第一条指令。
The last step performed by the control unit is equivalent to a jump to the interrupt or exception handler. In other words, the instruction processed by the control unit after dealing with the interrupt signal is the first instruction of the selected handler.
处理中断或异常后,相应的处理程序必须通过发出以下命令来放弃对被中断进程的控制:iret 指令,迫使控制单元:
After the interrupt or exception is processed, the corresponding
handler must relinquish control to the interrupted process by issuing
the iret instruction, which forces the control unit to:
使用保存在堆栈上的值加载cs、eip和寄存器。eflags如果硬件错误代码已压入堆栈内容顶部eip
,则必须在执行之前将其弹出iret。
Load the cs, eip, and eflags registers with the values saved
on the stack. If a hardware error code has been pushed in the
stack on top of the eip
contents, it must be popped before executing iret.
检查处理程序的 CPL 是否等于 的两个最低有效位中包含的值cs(这意味着被中断的进程与处理程序以相同的特权级别运行)。如果是,
iret则结束执行;否则,转至下一步。
Check whether the CPL of the handler is equal to the value
contained in the two least significant bits of cs (this means the interrupted process
was running at the same privilege level as the handler). If so,
iret concludes execution;
otherwise, go to the next step.
从堆栈加载ss和esp寄存器并返回到与旧特权级别关联的堆栈。
Load the ss and esp registers from the stack and return
to the stack associated with the old privilege level.
ds检查、es、fs和段寄存器的内容gs;如果其中任何一个包含选择器引用了DPL值低于CPL的段描述符,则清除相应的段寄存器。控制单元这样做是为了禁止以 CPL 等于 3 运行的用户模式程序使用先前由内核例程(DPL 等于 0)使用的段寄存器。如果这些寄存器没有被清除,恶意用户模式程序就可以利用它们来访问内核地址空间。
Examine the contents of the ds, es, fs, and gs segment registers; if any of them
contains a selector that refers to a Segment Descriptor whose DPL
value is lower than CPL, clear the corresponding segment register.
The control unit does this to forbid User Mode programs that run
with a CPL equal to 3 from using segment registers previously used
by kernel routines (with a DPL equal to 0). If these registers
were not cleared, malicious User Mode programs could exploit them
in order to access the kernel address space.
[ * ]更复杂的设备使用多条 IRQ 线。例如,PCI 卡最多可以使用 4 条 IRQ 线。
[*] More sophisticated devices use several IRQ lines. For instance, a PCI card can use up to four IRQ lines.
[ * ] Pentium 4 本地 APIC 没有仲裁优先级寄存器;仲裁机制隐藏在总线仲裁电路中。Intel手册指出,如果操作系统内核不定期更新任务优先级寄存器,性能可能不是最佳的,因为中断可能始终由同一个 CPU 提供服务。
[*] The Pentium 4 local APIC doesn't have an arbitration priority register; the arbitration mechanism is hidden in the bus arbitration circuitry. The Intel manuals state that if the operating system kernel does not regularly update the task priority registers , performance may be suboptimal because interrupts might always be serviced by the same CPU.
[ * ] 80 × 86 微处理器在执行结果无法存储为有符号整数的有符号除法(例如 -2,147,483,648 和 -1 之间的除法)时也会生成此异常。
[*] The 80 × 86 microprocessors also generate this exception when performing a signed division whose result cannot be stored as a signed integer (for instance, a division between -2,147,483,648 and -1).
[ * ] “双故障“异常,表示一种内核错误行为,是唯一通过任务门处理的异常(请参阅本章后面的“异常处理”部分)。
[*] The "Double fault " exception, which denotes a type of kernel misbehavior, is the only exception handled by means of a task gate (see the section "Exception Handling" later in this chapter.).
每个中断或异常都会产生一个 内核控制路径或代表当前进程在内核模式下执行的单独指令序列。例如,当I/O设备引发中断时,相应内核控制路径的第一条指令是将CPU寄存器的内容保存在内核模式堆栈中,而最后一条指令则是恢复寄存器的内容。
Every interrupt or exception gives rise to a kernel control path or separate sequence of instructions that execute in Kernel Mode on behalf of the current process. For instance, when an I/O device raises an interrupt, the first instructions of the corresponding kernel control path are those that save the contents of the CPU registers in the Kernel Mode stack, while the last are those that restore the contents of the registers.
内核控制路径可以任意嵌套;一个中断处理程序可能会被另一个中断处理程序打断,从而导致嵌套执行 内核控制路径,如图4-3所示。因此,处理中断的内核控制路径的最后一条指令并不总是将当前进程返回到用户模式:如果嵌套级别大于 1,这些指令将执行内核控制上次中断的路径,CPU 将继续运行在内核模式。
Kernel control paths may be arbitrarily nested; an interrupt handler may be interrupted by another interrupt handler, thus giving rise to a nested execution of kernel control paths , as shown in Figure 4-3. As a result, the last instructions of a kernel control path that is taking care of an interrupt do not always put the current process back into User Mode: if the level of nesting is greater than 1, these instructions will put into execution the kernel control path that was interrupted last, and the CPU will continue to run in Kernel Mode.
允许嵌套内核控制路径的代价是中断处理程序决不能阻塞,也就是说,在中断处理程序运行时不能发生进程切换。事实上,恢复嵌套内核控制路径所需的所有数据都存储在内核模式堆栈中,该堆栈与当前进程紧密绑定。
The price to pay for allowing nested kernel control paths is that an interrupt handler must never block, that is, no process switch can take place while an interrupt handler is running. In fact, all the data needed to resume a nested kernel control path is stored in the Kernel Mode stack, which is tightly bound to the current process.
假设内核没有错误,大多数异常只会在 CPU 处于用户模式时发生。事实上,它们要么是由编程错误引起的,要么是由调试器触发的。然而,“页面错误“ 内核模式下可能会发生异常。当进程尝试寻址属于其地址空间但当前不在 RAM 中的页面时,就会发生这种情况。在处理此类异常时,内核可能会挂起当前进程并将其替换为另一个进程直到请求的页面可用。一旦进程再次获得处理器,处理“页面错误”异常的内核控制路径就会恢复执行。
Assuming that the kernel is bug free, most exceptions can occur only while the CPU is in User Mode. Indeed, they are either caused by programming errors or triggered by debuggers. However, the "Page Fault " exception may occur in Kernel Mode. This happens when the process attempts to address a page that belongs to its address space but is not currently in RAM. While handling such an exception, the kernel may suspend the current process and replace it with another one until the requested page is available. The kernel control path that handles the "Page Fault" exception resumes execution as soon as the process gets the processor again.
由于“页面错误”异常处理程序永远不会引发进一步的异常,因此最多可以堆叠两个与异常相关的内核控制路径(第一个由系统调用调用引起,第二个由页面错误引起),一个在另一个的顶部。
Because the "Page Fault" exception handler never gives rise to further exceptions, at most two kernel control paths associated with exceptions (the first one caused by a system call invocation, the second one caused by a Page Fault) may be stacked, one on top of the other.
与异常相反,I/O 设备发出的中断并不引用特定于当前进程的数据结构,尽管处理它们的内核控制路径代表该进程运行。事实上,当给定的中断发生时,不可能预测哪个进程将运行。
In contrast to exceptions, interrupts issued by I/O devices do not refer to data structures specific to the current process, although the kernel control paths that handle them run on behalf of that process. As a matter of fact, it is impossible to predict which process will be running when a given interrupt occurs.
中断处理程序可以抢占其他中断处理程序和异常处理程序。相反,异常处理程序永远不会抢占中断处理程序。内核模式中唯一可以触发的异常是我们刚才描述的“页面错误”。但中断处理程序从不执行可能导致页面错误的操作,从而可能导致进程切换。
An interrupt handler may preempt both other interrupt handlers and exception handlers. Conversely, an exception handler never preempts an interrupt handler. The only exception that can be triggered in Kernel Mode is "Page Fault," which we just described. But interrupt handlers never perform operations that can induce page faults, and thus, potentially, a process switch.
Linux 交错内核控制路径有两个主要原因:
Linux interleaves kernel control paths for two major reasons:
提高可编程中断控制器和设备控制器的吞吐量。假设设备控制器在 IRQ 线上发出一个信号:PIC 将其转换为外部中断,然后 PIC 和设备控制器都保持阻塞状态,直到 PIC 收到 CPU 的确认。由于内核控制路径交错,即使内核正在处理先前的中断,它也能够发送确认。
To improve the throughput of programmable interrupt controllers and device controllers. Assume that a device controller issues a signal on an IRQ line: the PIC transforms it into an external interrupt, and then both the PIC and the device controller remain blocked until the PIC receives an acknowledgment from the CPU. Thanks to kernel control path interleaving, the kernel is able to send the acknowledgment even when it is handling a previous interrupt.
实现无优先级的中断模型。由于每个中断处理程序可以被另一个中断处理程序推迟,因此无需在硬件设备之间建立预定义的优先级。这简化了内核代码并提高了其可移植性。
To implement an interrupt model without priority levels. Because each interrupt handler may be deferred by another one, there is no need to establish predefined priorities among hardware devices. This simplifies the kernel code and improves its portability.
在多处理器系统上,多个内核控制路径可以同时执行。此外,与异常相关联的内核控制路径可能开始在CPU上执行,并且由于进程切换而迁移到另一个CPU。
On multiprocessor systems, several kernel control paths may execute concurrently. Moreover, a kernel control path associated with an exception may start executing on a CPU and, due to a process switch, migrate to another CPU.
现在我们了解了 80×86 微处理器在硬件级别如何处理中断和异常,我们可以继续描述中断描述符表是如何初始化的。
Now that we understand what the 80×86 microprocessors do with interrupts and exceptions at the hardware level, we can move on to describe how the Interrupt Descriptor Table is initialized.
请记住,在内核启用中断之前,它必须将 IDT 表的初始地址加载到idtr 注册并初始化该表的所有条目。此活动在初始化时完成系统(见附录 A)。
Remember that before the kernel enables the interrupts, it must
load the initial address of the IDT table into the idtr register and initialize all the entries of that table.
This activity is done while initializing the system (see Appendix A).
这int 指令允许用户态进程发出一个中断信号,该信号的向量范围为0到255之间的任意向量。因此,必须仔细初始化IDT,以阻止用户态进程通过指令模拟的非法中断和异常
int。这可以通过将特定中断或陷阱门描述符的 DPL 字段设置为 0 来实现。如果进程尝试发出这些中断信号之一,控制单元将根据 DPL 字段检查 CPL 值并发出“一般保护”“ 例外。
The int instruction allows a User Mode process to issue an
interrupt signal that has an arbitrary vector ranging from 0 to 255.
Therefore, initialization of the IDT must be done carefully, to block
illegal interrupts and exceptions simulated by User Mode processes via
int instructions. This can be
achieved by setting the DPL field of the particular Interrupt or Trap
Gate Descriptor to 0. If the process attempts to issue one of these
interrupt signals, the control unit checks the CPL value against the DPL
field and issues a "General protection " exception.
然而,在少数情况下,用户模式进程必须能够发出编程异常。为了实现这一点,将相应的中断或陷阱门描述符的 DPL 字段设置为 3 就足够了——即尽可能高。
In a few cases, however, a User Mode process must be able to issue a programmed exception. To allow this, it is sufficient to set the DPL field of the corresponding Interrupt or Trap Gate Descriptors to 3 — that is, as high as possible.
现在让我们看看Linux是如何实现这个策略的。
Let's now see how Linux implements this strategy.
正如前面“中断描述符表”一节中提到的,Intel提供了三种类型的中断描述符:任务、中断和陷阱门描述符。在对中断描述符表中包含的中断描述符进行分类时,Linux 使用与 Intel 略有不同的细分和术语:
As mentioned in the earlier section "Interrupt Descriptor Table," Intel provides three types of interrupt descriptors : Task, Interrupt, and Trap Gate Descriptors. Linux uses a slightly different breakdown and terminology from Intel when classifying the interrupt descriptors included in the Interrupt Descriptor Table:
用户模式进程无法访问的 Intel 中断门(门的 DPL 字段等于 0)。所有Linux中断处理程序都是通过中断门激活的,并且都仅限于内核模式。
An Intel interrupt gate that cannot be accessed by a User Mode process (the gate's DPL field is equal to 0). All Linux interrupt handlers are activated by means of interrupt gates , and all are restricted to Kernel Mode.
可由用户模式进程访问的 Intel 陷阱门(该门的 DPL 字段等于 3)。与向量 4、5 和 128 相关的三个 Linux 异常处理程序通过系统门激活,所以三个汇编语言指令
into ,bound
,并且可以在用户模式下发出。int $0x80
An Intel trap gate that can be accessed by a User Mode
process (the gate's DPL field is equal to 3). The three Linux
exception handlers associated with the vectors 4, 5, and 128 are
activated by means of system gates , so the three assembly language instructions
into , bound
, and int $0x80 can be issued in User
Mode.
可由用户模式进程访问的 Intel 中断门(门的 DPL 字段等于 3)。与向量 3 相关的异常处理程序是通过系统中断门激活的,因此汇编语言指令
int3 可以在用户模式下发出。
An Intel interrupt gate that can be accessed by a User
Mode process (the gate's DPL field is equal to 3). The exception
handler associated with the vector 3 is activated by means of a
system interrupt gate, so the assembly language instruction
int3 can be issued in User Mode.
用户模式进程无法访问的 Intel 陷阱门(该门的 DPL 字段等于 0)。大多数 Linux 异常处理程序都是通过陷阱门激活的。
An Intel trap gate that cannot be accessed by a User Mode process (the gate's DPL field is equal to 0). Most Linux exception handlers are activated by means of trap gates .
用户模式进程无法访问的 Intel 任务门(门的 DPL 字段等于 0)。“双重故障”的 Linux 处理程序“异常是通过任务门激活的。
An Intel task gate that cannot be accessed by a User Mode process (the gate's DPL field is equal to 0). The Linux handler for the "Double fault " exception is activated by means of a task gate.
以下与体系结构相关的函数用于在 IDT 中插入门:
The following architecture-dependent functions are used to insert gates in the IDT:
set_intr_gate(n,addr)set_intr_gate(n,addr)在第 n个 IDT 条目中插入一个中断门。门内的段选择器设置为内核代码的段选择器。Offset 字段设置为
addr,这是中断处理程序的地址。DPL 字段设置为 0。
Inserts an interrupt gate in the n th
IDT entry. The Segment Selector inside the gate is set to the
kernel code's Segment Selector. The Offset field is set to
addr, which is the address of
the interrupt handler. The DPL field is set to 0.
set_system_gate(n,addr)set_system_gate(n,addr)在第 n个 IDT 条目中插入一个陷阱门。门内的段选择器设置为内核代码的段选择器。Offset 字段设置为addr,它是异常处理程序的地址。DPL 字段设置为 3。
Inserts a trap gate in the n th IDT
entry. The Segment Selector inside the gate is set to the kernel
code's Segment Selector. The Offset field is set to addr, which is the address of the
exception handler. The DPL field is set to 3.
set_system_intr_gate(n,addr)set_system_intr_gate(n,addr)在第 n个 IDT 条目中插入一个中断门。门内的段选择器设置为内核代码的段选择器。Offset 字段设置为
addr,它是异常处理程序的地址。DPL 字段设置为 3。
Inserts an interrupt gate in the n th
IDT entry. The Segment Selector inside the gate is set to the
kernel code's Segment Selector. The Offset field is set to
addr, which is the address of
the exception handler. The DPL field is set to 3.
set_trap_gate(n,addr)set_trap_gate(n,addr)与上一个函数类似,只是 DPL 字段设置为 0。
Similar to the previous function, except the DPL field is set to 0.
set_task_gate(n,gdt)set_task_gate(n,gdt)在第 n个 IDT 条目中插入任务门。门内的段选择器将索引存储在包含要激活的功能的 TSS 的 GDT 中。Offset 字段设置为 0,而 DPL 字段设置为 3。
Inserts a task gate in the n th IDT entry. The Segment Selector inside the gate stores the index in the GDT of the TSS containing the function to be activated. The Offset field is set to 0, while the DPL field is set to 3.
当计算机仍以实模式运行时,IDT 会被 BIOS 例程初始化并使用。然而,一旦 Linux 接管,IDT 就会被移动到 RAM 的另一个区域并进行第二次初始化,因为 Linux 不使用任何 BIOS 例程(参见附录 A)。
The IDT is initialized and used by the BIOS routines while the computer still operates in Real Mode. Once Linux takes over, however, the IDT is moved to another area of RAM and initialized a second time, because Linux does not use any BIOS routine (see Appendix A).
IDT 存储在该idt_table表中,该表包含 256 个条目。该 6 字节idt_descr变量存储 IDT 的大小及其地址,并在系统初始化阶段使用,此时内核idtr使用lidt 汇编语言指令。[ * ]
The IDT is stored in the idt_table table, which includes 256 entries.
The 6-byte idt_descr variable
stores both the size of the IDT and its address and is used in the
system initialization phase when the kernel sets up the idtr register with the lidt assembly language instruction.[*]
在内核初始化期间,汇编语言函数首先用相同的中断门setup_idt( )填充所有 256 个条目,该
中断门指的是中断处理程序:idt_tableignore_int( )
During kernel initialization, the setup_idt( ) assembly language function
starts by filling all 256 entries of idt_table with the same interrupt gate,
which refers to the ignore_int( )
interrupt handler:
setup_idt:
leaignore_int,%edx
movl $(_ _KERNEL_CS << 16), %eax
movw %dx, %ax /* 选择器 = 0x0010 = cs */
movw $0x8e00, %dx /* 中断门,dpl=0,存在 */
lea idt_table, %edi
移动 $256, %ecx
rp_sidt:
movl %eax, (%edi)
movl %edx, 4(%edi)
添加 8 美元,%edi
十进制%ecx
jne rp_sidt
雷特 setup_idt:
lea ignore_int, %edx
movl $(_ _KERNEL_CS << 16), %eax
movw %dx, %ax /* selector = 0x0010 = cs */
movw $0x8e00, %dx /* interrupt gate, dpl=0, present */
lea idt_table, %edi
mov $256, %ecx
rp_sidt:
movl %eax, (%edi)
movl %edx, 4(%edi)
addl $8, %edi
dec %ecx
jne rp_sidt
retignore_int( )汇编语言的中断处理程序可以被视为执行以下操作的空处理程序:
The ignore_int( ) interrupt
handler, which is in assembly language, may be viewed as a null
handler that executes the following actions:
处理程序ignore_int( )永远不应该被执行。控制台或日志文件中出现“未知中断”消息表示硬件问题(I/O 设备正在发出不可预见的中断)或内核问题(未正确处理中断或异常)。
The ignore_int( ) handler
should never be executed. The occurrence of "Unknown interrupt"
messages on the console or in the log files denotes either a hardware
problem (an I/O device is issuing unforeseen interrupts) or a kernel
problem (an interrupt or exception is not being handled
properly).
在此初步初始化之后,内核在 IDT 中进行第二次传递,以用有意义的陷阱和中断处理程序替换一些空处理程序。完成此操作后,IDT 会为控制单元发出的每个不同异常以及中断控制器识别的每个 IRQ 提供专用中断、陷阱或系统门。
Following this preliminary initialization, the kernel makes a second pass in the IDT to replace some of the null handlers with meaningful trap and interrupt handlers. Once this is done, the IDT includes a specialized interrupt, trap, or system gate for each different exception issued by the control unit and for each IRQ recognized by the interrupt controller.
接下来的两节详细说明了如何针对异常和中断完成此操作。
The next two sections illustrate in detail how this is done for exceptions and interrupts.
[ * ]一些旧的奔腾型号有臭名昭著的“f00f”错误,它允许用户模式程序冻结系统。当在此类 CPU 上执行时,Linux 使用一种解决方法,即使用
idtr指向实际 IDT 的固定映射只读线性地址来初始化寄存器(请参阅第 2 章中的“固定映射线性地址”部分)。
[*] Some old Pentium models have the notorious "f00f" bug, which
allows User Mode programs to freeze the system. When executing on
such CPUs, Linux uses a workaround based on initializing the
idtr register with a fix-mapped
read-only linear address pointing to the actual IDT (see the
section "Fix-Mapped
Linear Addresses" in Chapter 2).
Linux 会将 CPU 发出的大多数异常解释为错误条件。当其中一种情况发生时,内核会向导致异常的进程发送一个信号,以通知其出现异常情况。例如,如果进程执行除以零,CPU 会引发“除法错误”“异常,并且相应的异常处理程序SIGFPE向当前进程发送信号,然后当前进程采取必要的步骤来恢复或(如果没有为该信号设置信号处理程序)中止。
Most exceptions issued by the CPU are interpreted by Linux
as error conditions. When one of them occurs, the kernel sends a signal
to the process that caused the exception to notify it of an anomalous
condition. If, for instance, a process performs a division by zero, the
CPU raises a "Divide error " exception, and the corresponding exception handler
sends a SIGFPE signal to the current
process, which then takes the necessary steps to recover or (if no
signal handler is set for that signal) abort.
然而,在某些情况下,Linux 利用 CPU 异常来更有效地管理硬件资源。第一种情况已在第 3 章“保存和加载 FPU、MMX 和 XMM 寄存器”部分中进行了描述。“设备不可用TS" 异常与标志一起使用cr0 寄存器强制内核向 CPU 的浮点寄存器加载新值。第二种情况涉及“页面错误”“异常,用于推迟向进程分配新页框,直到最后可能的时刻。相应的处理程序很复杂,因为异常可能表示也可能不表示错误情况(请参阅《页面错误异常处理程序》中的“页面错误异常处理程序”部分)第 9 章)。
There are a couple of cases, however, where Linux exploits CPU
exceptions to manage hardware resources more efficiently. A first case
is already described in the section "Saving and Loading the FPU, MMX,
and XMM Registers" in Chapter
3. The "Device not available " exception is used together with the TS flag of the cr0 register to force the kernel to load the floating point
registers of the CPU with new values. A second case involves the "Page
Fault " exception, which is used to defer allocating new page
frames to the process until the last possible moment. The corresponding
handler is complex because the exception may, or may not, denote an
error condition (see the section "Page Fault Exception Handler"
in Chapter 9).
异常处理程序具有由三个步骤组成的标准结构:
Exception handlers have a standard structure consisting of three steps:
将大多数寄存器的内容保存在内核态堆栈中(这部分是用汇编语言编码的)。
Save the contents of most registers in the Kernel Mode stack (this part is coded in assembly language).
通过高级 C 函数处理异常。
Handle the exception by means of a high-level C function.
通过该函数退出处理ret_from_exception( )程序。
Exit from the handler by means of the ret_from_exception( ) function.
为了利用异常,必须使用异常处理函数为每个识别的异常正确初始化 IDT。该函数的工作是trap_init(
)将最终值(处理异常的函数)插入到引用不可屏蔽中断和异常的所有 IDT 条目中。这是通过set_trap_gate( )、set_intr_gate( )、set_system_gate( )、set_system_intr_gate( )和set_task_gate( )函数完成的:
To take advantage of exceptions, the IDT must be properly
initialized with an exception handler function for each recognized
exception. It is the job of the trap_init(
) function to insert the final values—the functions that
handle the exceptions—into all IDT entries that refer to nonmaskable
interrupts and exceptions. This is accomplished through the set_trap_gate( ), set_intr_gate( ), set_system_gate( ), set_system_intr_gate( ), and set_task_gate( ) functions:
set_trap_gate(0,÷_error);
set_trap_gate(1,&调试);
set_intr_gate(2,&nmi);
set_system_intr_gate(3,&int3);
set_system_gate(4,&溢出);
set_system_gate(5,&边界);
set_trap_gate(6,&invalid_op);
set_trap_gate(7,&device_not_available);
设置任务门(8,31);
set_trap_gate(9,&coprocessor_segment_overrun);
set_trap_gate(10,&invalid_TSS);
set_trap_gate(11,&segment_not_present);
set_trap_gate(12,&stack_segment);
set_trap_gate(13,&general_protection);
set_intr_gate(14,&page_fault);
set_trap_gate(16,&协处理器错误);
set_trap_gate(17,&alignment_check);
set_trap_gate(18,&machine_check);
set_trap_gate(19,&simd_coprocessor_error);
set_system_gate(128,&system_call); set_trap_gate(0,÷_error);
set_trap_gate(1,&debug);
set_intr_gate(2,&nmi);
set_system_intr_gate(3,&int3);
set_system_gate(4,&overflow);
set_system_gate(5,&bounds);
set_trap_gate(6,&invalid_op);
set_trap_gate(7,&device_not_available);
set_task_gate(8,31);
set_trap_gate(9,&coprocessor_segment_overrun);
set_trap_gate(10,&invalid_TSS);
set_trap_gate(11,&segment_not_present);
set_trap_gate(12,&stack_segment);
set_trap_gate(13,&general_protection);
set_intr_gate(14,&page_fault);
set_trap_gate(16,&coprocessor_error);
set_trap_gate(17,&alignment_check);
set_trap_gate(18,&machine_check);
set_trap_gate(19,&simd_coprocessor_error);
set_system_gate(128,&system_call);“双重错误”异常是通过任务门而不是陷阱或系统门来处理的,因为它表示严重的内核错误行为。因此,尝试打印出寄存器值的异常处理程序不信任寄存器的当前值esp。当此类异常发生时,CPU 会获取存储在 IDT 索引 8 的条目中的任务门描述符。该描述符指向存储在GDT第32个条目中的特殊TSS段描述符。接下来,CPU 加载eip并esp使用存储在相应 TSS 段中的值进行寄存器。结果,处理器
doublefault_fn()在其自己的私有堆栈上执行异常处理程序。
The "Double fault" exception is handled by means of a task gate
instead of a trap or system gate, because it denotes a serious kernel
misbehavior. Thus, the exception handler that tries to print out the
register values does not trust the current value of the esp register. When such an exception occurs,
the CPU fetches the Task Gate Descriptor stored in the entry at index 8
of the IDT. This descriptor points to the special TSS segment descriptor
stored in the 32nd entry of the GDT. Next,
the CPU loads the eip and esp registers with the values stored in the
corresponding TSS segment. As a result, the processor executes the
doublefault_fn() exception handler on
its own private stack.
现在我们将看看典型的异常处理程序在被调用后会做什么。由于篇幅有限,我们对异常处理的描述会有点粗略。特别是我们将无法涵盖:
Now we will look at what a typical exception handler does once it is invoked. Our description of exception handling will be a bit sketchy for lack of space. In particular we won't be able to cover:
The signal codes (see Table 11-8 in Chapter 11) sent by some handlers to the User Mode processes.
Exceptions that occur when the kernel is operating in MS-DOS emulation mode (vm86 mode), which must be dealt with differently.
让我们使用handler_name来表示通用异常处理程序的名称。(所有异常处理程序的实际名称都出现在上一节中的宏列表中。)每个异常处理程序都以以下汇编语言指令开头:
Let's use handler_name to denote the name of a generic
exception handler. (The actual names of all the exception handlers
appear on the list of macros in the previous section.) Each exception
handler starts with the following assembly language
instructions:
处理程序名称:
Pushl $0 /* 仅适用于某些例外 */
Pushl $do_handler_name
jmp错误代码 handler_name:
pushl $0 /* only for some exceptions */
pushl $do_handler_name
jmp error_code如果控制单元不应该在异常发生时自动在堆栈上插入硬件错误代码,则相应的汇编语言片段包括用pushl $0空值填充堆栈的指令。然后将高级C函数的地址压入堆栈;它的名称由前缀为 的异常处理程序名称组成
do_。
If the control unit is not supposed to automatically insert a
hardware error code on the stack when the exception occurs, the
corresponding assembly language fragment includes a pushl $0 instruction to pad the stack with a
null value. Then the address of the high-level C function is pushed on
the stack; its name consists of the exception handler name prefixed by
do_.
标记为 的汇编语言片段error_code对于所有异常处理程序都是相同的,除了“设备不可用”的异常处理程序之外。”异常(请参阅第 3 章中的“保存和加载 FPU、MMX 和 XMM 寄存器”部分)。该代码执行以下步骤:
The assembly language fragment labeled as error_code is the same for all exception
handlers except the one for the "Device not available " exception (see the section "Saving and Loading the FPU, MMX,
and XMM Registers" in Chapter 3). The code performs the
following steps:
将高级 C 函数可能使用的寄存器保存在堆栈上。
Saves the registers that might be used by the high-level C function on the stack.
问题一cld 清除方向标志DF的指令eflags ,从而确保
edi和esi寄存器上的自动递增将与字符串指令一起使用。[ * ]
Issues a cld instruction to clear the direction flag DF of eflags , thus making sure that autoincreases on the
edi and esi registers will be used with string
instructions .[*]
esp+36复制保存在堆栈中位置中的硬件错误代码edx。将值 -1 存储在同一堆栈位置。正如我们将在第 11 章的“系统调用的重新执行”部分中看到的,该值用于将异常与其他异常分开。0x80
Copies the hardware error code saved in the stack at
location esp+36 in edx. Stores the value -1 in the same
stack location. As we'll see in the section "Reexecution of System
Calls" in Chapter
11, this value is used to separate 0x80 exceptions from other
exceptions.
加载保存在堆栈位置 处的edi高级 C 函数的地址;将 的内容写入该堆栈位置。do_handler_name( )esp+32es
Loads edi with the
address of the high-level do_handler_name( ) C function saved in
the stack at location esp+32;
writes the contents of es in
that stack location.
eax将内核模式堆栈的当前顶部位置加载到寄存器中。该地址标识包含步骤 1 中保存的最后一个寄存器值的存储单元。
Loads in the eax register
the current top location of the Kernel Mode stack. This address
identifies the memory cell containing the last register value
saved in step 1.
将用户数据段选择器加载到ds和es寄存器中。
Loads the user data Segment Selector into the ds and es registers.
调用高级 C 函数,其地址现在存储在 中edi。
Invokes the high-level C function whose address is now
stored in edi.
被调用的函数从寄存器而不是堆栈接收eax参数edx。我们已经遇到过一个从 CPU 寄存器获取参数的函数:该函数在第 3 章的“执行进程切换”_ _switch_to( )
部分中讨论过。
The invoked function receives its arguments from the eax and edx registers rather than from the stack. We
have already run into a function that gets its arguments from the CPU
registers: the _ _switch_to( )
function, discussed in the section "Performing the Process
Switch" in Chapter
3.
正如已经解释的,实现异常处理程序的 C 函数的名称始终由前缀do_和后跟处理程序名称组成。大多数这些函数都会调用do_trap()函数来将硬件错误代码和异常向量存储在 的进程描述符中
current,然后向该进程发送合适的信号:
As already explained, the names of the C functions that
implement exception handlers always consist of the prefix do_ followed by the handler name. Most of
these functions invoke the do_trap() function to store the hardware
error code and the exception vector in the process descriptor of
current, and then send a suitable
signal to that process:
当前->thread.error_code = error_code;
当前->thread.trap_no =向量;
force_sig(sig_number, 当前); current->thread.error_code = error_code;
current->thread.trap_no = vector;
force_sig(sig_number, current);当前进程在异常处理程序终止后立即处理该信号。该信号将在用户模式下由进程自己的信号处理程序(如果存在)或在内核模式下处理。在后一种情况下,内核通常会杀死该进程(参见第 11 章)。表 4-1列出了异常处理程序发送的信号。
The current process takes care of the signal right after the termination of the exception handler. The signal will be handled either in User Mode by the process's own signal handler (if it exists) or in Kernel Mode. In the latter case, the kernel usually kills the process (see Chapter 11). The signals sent by the exception handlers are listed in Table 4-1.
异常处理程序始终检查异常是在用户模式还是在内核模式中发生,并且在后一种情况下,检查是否是由于传递给系统调用的参数无效所致。我们将在第 10 章的“动态地址检查:修复代码”一节中描述内核如何防御传递给系统调用的无效参数。内核模式中引发的任何其他异常都是由于内核错误造成的。在这种情况下,异常处理程序知道内核行为异常。为了避免硬盘上的数据损坏,处理程序调用该函数,该函数在控制台上打印所有 CPU 寄存器的内容(此转储称为内核 oopsdie(
)
)并current通过调用终止进程do_exit( )(参见第 3 章中的“进程终止”
)。
The exception handler always checks whether the exception
occurred in User Mode or in Kernel Mode and, in the latter case,
whether it was due to an invalid argument passed to a system call.
We'll describe in the section "Dynamic Address Checking: The
Fix-up Code" in Chapter
10 how the kernel defends itself against invalid arguments
passed to system calls. Any other exception raised in Kernel Mode is
due to a kernel bug. In this case, the exception handler knows the
kernel is misbehaving. In order to avoid data corruption on the hard
disks, the handler invokes the die(
) function, which prints the contents of all CPU registers
on the console (this dump is called kernel oops
) and terminates the current process by calling do_exit( ) (see "Process Termination" in
Chapter 3).
当实现异常处理的 C 函数终止时,代码执行jmp
该ret_from_exception(
)函数的指令。该函数将在后面的“从中断和异常返回”部分中描述。
When the C function that implements the exception handling
terminates, the code performs a jmp
instruction to the ret_from_exception(
) function. This function is described in the later section
"Returning from Interrupts
and Exceptions."
正如我们之前所解释的,大多数异常都是通过向引发异常的进程发送 Unix 信号来简单地处理的。因此,要采取的操作会被推迟,直到进程收到信号为止;因此,内核能够快速处理异常。
As we explained earlier, most exceptions are handled simply by sending a Unix signal to the process that caused the exception. The action to be taken is thus deferred until the process receives the signal; as a result, the kernel is able to process the exception quickly.
这种方法不适用于中断,因为它们经常在与其相关的进程(例如,请求数据传输的进程)挂起并且完全不相关的进程正在运行之后很久才到达。因此向当前进程发送 Unix 信号是没有意义的。
This approach does not hold for interrupts, because they frequently arrive long after the process to which they are related (for instance, a process that requested a data transfer) has been suspended and a completely unrelated process is running. So it would make no sense to send a Unix signal to the current process.
中断处理取决于中断类型。出于我们的目的,我们将区分三类主要的中断:
Interrupt handling depends on the type of interrupt. For our purposes, we'll distinguish three main classes of interrupts:
I/O 设备需要注意;相应的中断处理程序必须查询设备以确定正确的操作过程。我们将在后面的“ I/O 中断处理”部分介绍这种类型的中断。
An I/O device requires attention; the corresponding interrupt handler must query the device to determine the proper course of action. We cover this type of interrupt in the later section "I/O Interrupt Handling."
某些定时器(本地 APIC 定时器或外部定时器)已发出中断;这种中断告诉内核固定时间间隔已经过去。这些中断主要作为 I/O 中断进行处理;我们将在第 6 章中讨论定时器中断的特殊特性。
Some timer, either a local APIC timer or an external timer, has issued an interrupt; this kind of interrupt tells the kernel that a fixed-time interval has elapsed. These interrupts are handled mostly as I/O interrupts; we discuss the peculiar characteristics of timer interrupts in Chapter 6.
一个 CPU 向多处理器系统的另一个 CPU 发出中断。我们将在后面的“处理器间中断处理”部分中介绍此类中断。
A CPU issued an interrupt to another CPU of a multiprocessor system. We cover such interrupts in the later section "Interprocessor Interrupt Handling."
一般来说,I/O 中断处理程序必须足够灵活,能够同时服务多个设备。例如,在 PCI 总线架构中,多个设备可能共享同一条 IRQ 线。这意味着中断向量本身并不能说明全部情况。在表4-3所示的示例中,相同的向量43被分配给USB端口和声卡。然而,如果 IRQ 线路与其他设备共享,旧 PC 架构(例如 ISA)中的某些硬件设备将无法可靠运行。
In general, an I/O interrupt handler must be flexible enough to service several devices at the same time. In the PCI bus architecture, for instance, several devices may share the same IRQ line. This means that the interrupt vector alone does not tell the whole story. In the example shown in Table 4-3, the same vector 43 is assigned to the USB port and to the sound card. However, some hardware devices found in older PC architectures (such as ISA) do not reliably operate if their IRQ line is shared with other devices.
中断处理程序的灵活性是通过两种不同的方式实现的,如下列表中所述。
Interrupt handler flexibility is achieved in two distinct ways, as discussed in the following list.
中断处理程序执行几个中断服务程序 (ISR)。每个 ISR 都是与共享 IRQ 线的单个设备相关的函数。由于无法提前知道哪个特定设备发出了 IRQ,因此每个 ISR 都会执行以验证其设备是否需要关注;如果是,ISR 将执行设备引发中断时需要执行的所有操作。
The interrupt handler executes several interrupt service routines (ISRs). Each ISR is a function related to a single device sharing the IRQ line. Because it is not possible to know in advance which particular device issued the IRQ, each ISR is executed to verify whether its device needs attention; if so, the ISR performs all the operations that need to be executed when the device raises an interrupt.
IRQ 线在最后可能的时刻与设备驱动程序关联;例如,只有当用户访问软盘设备时才分配软盘设备的IRQ线。这样,即使多个硬件设备不能共享IRQ线,也可以使用相同的IRQ向量;当然,硬件设备不能同时使用。(请参阅本节末尾的讨论。)
An IRQ line is associated with a device driver at the last possible moment; for instance, the IRQ line of the floppy device is allocated only when a user accesses the floppy disk device. In this way, the same IRQ vector may be used by several hardware devices even if they cannot share the IRQ line; of course, the hardware devices cannot be used at the same time. (See the discussion at the end of this section.)
发生中断时,并非所有要执行的操作都具有相同的紧急程度。事实上,中断处理程序本身并不适合执行所有类型的操作。应推迟较长的非关键操作,因为当中断处理程序运行时,相应 IRQ 线上的信号将暂时被忽略。最重要的是,执行中断处理程序所代表的进程必须始终保持在该TASK_RUNNING状态,否则可能会发生系统冻结。因此,中断处理程序不能执行任何阻塞过程,例如 I/O 磁盘操作。Linux 将中断后要执行的操作分为三类:
Not all actions to be performed when an interrupt occurs have
the same urgency. In fact, the interrupt handler itself is not a
suitable place for all kind of actions. Long noncritical operations
should be deferred, because while an interrupt handler is running, the
signals on the corresponding IRQ line are temporarily ignored. Most
important, the process on behalf of which an interrupt handler is
executed must always stay in the TASK_RUNNING state, or a system freeze can
occur. Therefore, interrupt handlers cannot perform any blocking
procedure such as an I/O disk operation. Linux divides the actions to
be performed following an interrupt into three classes:
诸如确认 PIC 中断、重新编程 PIC 或器件控制器或更新器件和处理器访问的数据结构等操作。这些可以快速执行并且至关重要,因为它们必须尽快执行。关键操作立即在中断处理程序中执行,并禁用可屏蔽中断。
Actions such as acknowledging an interrupt to the PIC, reprogramming the PIC or the device controller, or updating data structures accessed by both the device and the processor. These can be executed quickly and are critical, because they must be performed as soon as possible. Critical actions are executed within the interrupt handler immediately, with maskable interrupts disabled.
诸如更新仅由处理器访问的数据结构之类的操作(例如,按下键盘按键后读取扫描代码)。这些操作也可以快速完成,因此中断处理程序会立即执行它们,并启用中断。
Actions such as updating data structures that are accessed only by the processor (for instance, reading the scan code after a keyboard key has been pushed). These actions can also finish quickly, so they are executed by the interrupt handler immediately, with the interrupts enabled.
将缓冲区的内容复制到进程的地址空间等操作(例如,将键盘行缓冲区发送到终端处理程序进程)。这些可能会延迟很长一段时间而不影响内核操作;感兴趣的进程将继续等待数据。非关键的可延迟操作是通过单独的函数来执行的,这些函数将在后面的“软中断和 Tasklet ”部分中讨论。
Actions such as copying a buffer's contents into the address space of a process (for instance, sending the keyboard line buffer to the terminal handler process). These may be delayed for a long time interval without affecting the kernel operations; the interested process will just keep waiting for the data. Noncritical deferrable actions are performed by means of separate functions that are discussed in the later section "Softirqs and Tasklets."
无论引起中断的电路类型如何,所有 I/O 中断处理程序都会执行相同的四个基本操作:
Regardless of the kind of circuit that caused the interrupt, all I/O interrupt handlers perform the same four basic actions:
将 IRQ 值和寄存器的内容保存在内核模式堆栈上。
Save the IRQ value and the register's contents on the Kernel Mode stack.
向正在为 IRQ 线提供服务的 PIC 发送确认,从而允许它发出进一步的中断。
Send an acknowledgment to the PIC that is servicing the IRQ line, thus allowing it to issue further interrupts.
执行与共享 IRQ 的所有设备关联的中断服务例程 (ISR)。
Execute the interrupt service routines (ISRs) associated with all the devices that share the IRQ.
通过跳转到该ret_from_intr( )地址来终止。
Terminate by jumping to the ret_from_intr( ) address.
需要几个描述符来表示 IRQ 线的状态以及中断发生时要执行的函数。图 4-4 示意性地表示了用于处理中断的硬件电路和软件功能。这些函数将在以下各节中讨论。
Several descriptors are needed to represent both the state of the IRQ lines and the functions to be executed when an interrupt occurs. Figure 4-4 represents in a schematic way the hardware circuits and the software functions used to handle an interrupt. These functions are discussed in the following sections.
如表 4-2所示,可以为物理 IRQ 分配 32-238 范围内的任何向量。然而,Linux使用向量128来实现系统调用。
As illustrated in Table 4-2, physical IRQs may be assigned any vector in the range 32-238. However, Linux uses vector 128 to implement system calls.
IBM 兼容的 PC 架构要求某些设备静态连接到特定的 IRQ 线路。尤其:
The IBM-compatible PC architecture requires that some devices be statically connected to specific IRQ lines. In particular:
间隔定时器设备必须连接到 IRQ 0 线(参见第 6 章)。
The interval timer device must be connected to the IRQ 0 line (see Chapter 6).
从属 8259A PIC 必须连接到 IRQ 2 线(尽管现在正在使用更高级的 PIC,但 Linux 仍然支持 8259A 型 PIC)。
The slave 8259A PIC must be connected to the IRQ 2 line (although more advanced PICs are now being used, Linux still supports 8259A-style PICs).
外部数学协处理器必须连接到 IRQ 13 线(尽管最近的 80 × 86 处理器不再使用这样的设备,但 Linux 继续支持 Hardy 80386 模型)。
The external mathematical coprocessor must be connected to the IRQ 13 line (although recent 80 × 86 processors no longer use such a device, Linux continues to support the hardy 80386 model).
一般来说,I/O 设备可以连接到有限数量的 IRQ 线。(事实上,当使用无法共享 IRQ 的旧 PC 时,由于 IRQ 与其他现有硬件设备发生冲突,您可能无法成功安装新卡。)
In general, an I/O device can be connected to a limited number of IRQ lines. (As a matter of fact, when playing with an old PC where IRQ sharing is not possible, you might not succeed in installing a new card because of IRQ conflicts with other already present hardware devices.)
表 4-2。Linux 中的中断向量
Table 4-2. Interrupt vectors in Linux
矢量范围 Vector range | 使用 Use |
|---|---|
0–19 0–19 | 不可屏蔽中断和异常 Nonmaskable interrupts and exceptions |
20–31 20–31 | 英特尔保留 Intel-reserved |
32–127 32–127 | 外部中断 (IRQ) External interrupts (IRQs) |
128 128 | 系统调用的编程异常(参见第 10 章) Programmed exception for system calls (see Chapter 10) |
129–238 129–238 | 外部中断 (IRQ) External interrupts (IRQs) |
239 239 | 本地APIC定时器中断(参见 第6章) Local APIC timer interrupt (see Chapter 6) |
240 ( 240 ( | 本地 APIC 热中断(在 Pentium 4 型号中引入) Local APIC thermal interrupt (introduced in the Pentium 4 models) |
241–250 241–250 | 由 Linux 保留以供将来使用 Reserved by Linux for future use |
251–253 251–253 | 处理器间中断(请参阅本章后面的“处理器间中断处理”部分) Interprocessor interrupts (see the section "Interprocessor Interrupt Handling" later in this chapter) |
254 ( 254 ( | 本地APIC错误中断(本地APIC检测到错误情况时生成) Local APIC error interrupt (generated when the local APIC detects an erroneous condition) |
255 ( 255 ( | 本地 APIC 虚假中断(如果 CPU 屏蔽中断而硬件设备引发中断则生成) Local APIC spurious interrupt (generated if the CPU masks an interrupt while the hardware device raises it) |
可以通过三种方式为 IRQ 可配置设备选择线路:
There are three ways to select a line for an IRQ-configurable device:
通过设置硬件跳线(仅在非常旧的设备卡上)。
By setting hardware jumpers (only on very old device cards).
通过设备附带的实用程序并在安装时执行。此类程序可以要求用户选择可用的 IRQ 编号,也可以探测系统以自行确定可用的编号。
By a utility program shipped with the device and executed when installing it. Such a program may either ask the user to select an available IRQ number or probe the system to determine an available number by itself.
通过系统启动时执行的硬件协议。外围设备声明它们准备使用哪些中断线;然后协商最终值以尽可能减少冲突。完成此操作后,每个中断处理程序都可以使用访问设备的某些 I/O 端口的函数来读取分配的 IRQ。例如,符合外围组件互连 (PCI) 标准的设备驱动程序使用一组函数,例如pci_read_config_byte( )访问设备配置空间。
By a hardware protocol executed at system startup.
Peripheral devices declare which interrupt lines they are ready
to use; the final values are then negotiated to reduce conflicts
as much as possible. Once this is done, each interrupt handler
can read the assigned IRQ by using a function that accesses some
I/O ports of the device. For instance, drivers for devices that
comply with the Peripheral Component Interconnect (PCI) standard
use a group of functions such as pci_read_config_byte( ) to access the
device configuration space.
表 4-3 显示了设备和 IRQ 的相当任意的排列,例如可能在一台特定 PC 上找到的设备和 IRQ。
Table 4-3 shows a fairly arbitrary arrangement of devices and IRQs, such as those that might be found on one particular PC.
表 4-3。I/O 设备的 IRQ 分配示例
Table 4-3. An example of IRQ assignment to I/O devices
中断请求 IRQ | INT INT | 硬件设备 Hardware device |
|---|---|---|
0 0 | 32 32 | 定时器 Timer |
1 1 | 33 33 | 键盘 Keyboard |
2 2 | 34 34 | PIC级联 PIC cascading |
3 3 | 35 35 | 第二个串口 Second serial port |
4 4 | 36 36 | 第一个串口 First serial port |
6 6 | 38 38 | 软盘 Floppy disk |
8 8 | 40 40 | 系统时钟 System clock |
10 10 | 42 42 | 网络接口 Network interface |
11 11 | 43 43 | USB端口、声卡 USB port, sound card |
12 12 | 44 44 | PS/2鼠标 PS/2 mouse |
13 13 | 45 45 | 数学协处理器 Mathematical coprocessor |
14 14 | 46 46 | EIDE磁盘控制器第一链 EIDE disk controller's first chain |
15 15 | 47 47 | EIDE磁盘控制器的第二链 EIDE disk controller's second chain |
在启用中断之前,内核必须发现哪个 I/O 设备对应于 IRQ 号。否则,例如,在不知道哪个向量对应于设备的情况下,内核如何处理来自 SCSI 磁盘的信号?该对应关系是在初始化每个设备驱动程序时建立的(参见第 13 章)。
The kernel must discover which I/O device corresponds to the IRQ number before enabling interrupts. Otherwise, for example, how could the kernel handle a signal from a SCSI disk without knowing which vector corresponds to the device? The correspondence is established while initializing each device driver (see Chapter 13).
与往常一样,在讨论涉及状态转换的复杂操作时,首先了解关键数据的存储位置会有所帮助。因此,本节解释了支持中断处理的数据结构以及它们如何在各种描述符中布局。图 4-5示意性地说明了表示 IRQ 线状态的主要描述符之间的关系。(该图没有说明处理软中断和微线程所需的数据结构;本章稍后将讨论它们。)
As always, when discussing complicated operations involving state transitions, it helps to understand first where key data is stored. Thus, this section explains the data structures that support interrupt handling and how they are laid out in various descriptors. Figure 4-5 illustrates schematically the relationships between the main descriptors that represent the state of the IRQ lines. (The figure does not illustrate the data structures needed to handle softirqs and tasklets; they are discussed later in this chapter.)
每个中断向量都有自己的irq_desc_t描述符,其字段列于表 4-4中。所有此类描述符都分组在
irq_desc数组中。
Every interrupt vector has its own irq_desc_t descriptor, whose fields are
listed in Table
4-4. All such descriptors are grouped together in the
irq_desc array.
表 4-4。irq_desc_t 描述符
Table 4-4. The irq_desc_t descriptor
场地 Field | 描述 Description |
|---|---|
| 指向为 Points to the PIC object ( |
| 指向 PIC 方法使用的数据的指针。 Pointer to data used by the PIC methods. |
| 识别中断服务例程当 IRQ 发生时被调用。该字段指向 Identifies the interrupt service
routines to be invoked when the IRQ occurs. The field
points to the first element of the list of |
| 一组描述 IRQ 线路状态的标志(参见表4-5)。 A set of flags describing the IRQ line status (see Table 4-5). |
| 如果 IRQ 线已启用,则显示 0;如果已禁用至少一次,则显示正值。 Shows 0 if the IRQ line is enabled and a positive value if it has been disabled at least once. |
| IRQ 线上中断发生的计数器(仅用于诊断用途)。 Counter of interrupt occurrences on the IRQ line (for diagnostic use only). |
| IRQ 线上未处理中断发生的计数器(仅用于诊断用途)。 Counter of unhandled interrupt occurrences on the IRQ line (for diagnostic use only). |
| 自旋锁用于串行化对 IRQ 描述符和 PIC 的访问(参见第 5 章)。 A spin lock used to serialize the accesses to the IRQ descriptor and to the PIC (see Chapter 5). |
意外中断 如果它不由内核处理,也就是说,如果没有与 IRQ 线关联的 ISR,或者没有与该线关联的 ISR 识别由其自己的硬件设备引发的中断。通常,内核会检查 IRQ 线路上收到的意外中断的数量,以便在有故障的硬件设备不断引发中断的情况下禁用该线路。由于 IRQ 线可以在多个设备之间共享,因此内核不会在检测到单个未处理的情况后立即禁用该线打断。相反,内核分别在描述符的irq_count和irqs_unhandled字段中存储irq_desc_t中断总数和意外中断数;当产生第 100,000个中断时,如果未处理的中断数量超过 99,900 个(也就是说,如果最后接收到的 100,000 个中断中少于 101 个中断是来自共享该线的硬件设备的预期中断),则内核将禁用该线。
An interrupt is unexpected if it is not handled by the kernel, that is, either
if there is no ISR associated with the IRQ line, or if no ISR
associated with the line recognizes the interrupt as raised by its
own hardware device. Usually the kernel checks the number of
unexpected interrupts received on an IRQ line, so as to disable the
line in case a faulty hardware device keeps raising an interrupt
over and over. Because the IRQ line can be shared among several
devices, the kernel does not disable the line as soon as it detects
a single unhandled interrupt. Rather, the kernel stores in the irq_count and irqs_unhandled fields of the irq_desc_t descriptor the total number of
interrupts and the number of unexpected interrupts, respectively;
when the 100,000th interrupt is raised,
the kernel disables the line if the number of unhandled interrupts
is above 99,900 (that is, if less than 101 interrupts over the last
100,000 received are expected interrupts from hardware devices
sharing the line).
IRQ 线的状态由表4-5中列出的标志描述 。
The status of an IRQ line is described by the flags listed in Table 4-5.
表 4-5。描述 IRQ 线状态的标志
Table 4-5. Flags describing the IRQ line status
旗帜名称 Flag name | 描述 Description |
|---|---|
| 正在执行 IRQ 的处理程序。 A handler for the IRQ is being executed. |
| IRQ 线已被设备驱动程序故意禁用。 The IRQ line has been deliberately disabled by a device driver. |
| 线路上发生了 IRQ;PIC 已确认其发生,但内核尚未对其提供服务。 An IRQ has occurred on the line; its occurrence has been acknowledged to the PIC, but it has not yet been serviced by the kernel. |
| IRQ 线已被禁用,但 PIC 尚未确认先前的 IRQ 发生。 The IRQ line has been disabled but the previous IRQ occurrence has not yet been acknowledged to the PIC. |
| 内核在执行硬件设备探测时使用 IRQ 线。 The kernel is using the IRQ line while performing a hardware device probe. |
| 内核在执行硬件设备探测时使用 IRQ 线;而且,还没有引发相应的中断。 The kernel is using the IRQ line while performing a hardware device probe; moreover, the corresponding interrupt has not been raised. |
| 不在80×86架构上使用。 Not used on the 80 × 86 architecture. |
| 不曾用过。 Not used. |
| 不在80×86架构上使用。 Not used on the 80 × 86 architecture. |
描述符的字段和标志指定depthIRQ
线是启用还是禁用。每次调用or函数时,该字段都会增加;如果
等于 0,则该函数禁用 IRQ 线并设置其标志。[ * ]相反,每次调用该函数都会减少该字段;如果变为 0,该函数启用 IRQ 线并清除其标志。IRQ_DISABLEDirq_desc_tdisable_irq( )disable_irq_nosync( )depthdepthIRQ_DISABLEDenable_irq( )depthIRQ_DISABLED
The depth field and the
IRQ_DISABLED flag of the irq_desc_t descriptor specify whether the
IRQ line is enabled or disabled. Every time the disable_irq( ) or disable_irq_nosync( ) function is invoked,
the depth field is increased; if
depth is equal to 0, the function
disables the IRQ line and sets its IRQ_DISABLED flag.[*] Conversely, each invocation of the enable_irq( ) function decreases the
field; if depth becomes 0, the
function enables the IRQ line and clears its IRQ_DISABLED flag.
在系统初始化期间,该init_IRQ( )函数将status每个 IRQ 主描述符的字段设置为IRQ _DISABLED。此外,
通过用新的中断门替换(请参阅本章前面的“ IDT 的初步初始化”部分)init_IRQ( )设置的中断门来更新 IDT 。这是通过以下语句完成的:setup_idt( )
During system initialization, the init_IRQ( ) function sets the status field of each IRQ main descriptor
to IRQ _DISABLED. Moreover,
init_IRQ( ) updates the IDT by
replacing the interrupt gates set up by setup_idt( ) (see the section "Preliminary Initialization of
the IDT," earlier in this chapter) with new ones. This is
accomplished through the following statements:
for (i = 0; i < NR_IRQS; i++)
如果(i+32!= 128)
set_intr_gate(i+32,中断[i]); for (i = 0; i < NR_IRQS; i++)
if (i+32 != 128)
set_intr_gate(i+32,interrupt[i]);此代码在interrupt数组中查找以查找用于设置中断门的中断处理程序地址。数组的每个条目n存储 IRQ ninterrupt的中断处理程序的地址(请参阅后面的“保存中断处理程序的寄存器”部分)。请注意,与向量 128 对应的中断门保持不变,因为它用于系统调用的编程异常。
This code looks in the interrupt array to find the interrupt
handler addresses that it uses to set up the interrupt
gates . Each entry n of the interrupt array stores the address of the
interrupt handler for IRQ n (see the later
section "Saving the
registers for the interrupt handler"). Notice that the
interrupt gate corresponding to vector 128 is left untouched,
because it is used for the system call's programmed
exception.
除了本章开头提到的 8259A 芯片之外,Linux 还支持其他几种 PIC 电路,例如 SMP IO-APIC、Intel PIIX4 的内部 8259 PIC 和 SGI 的 Visual Workstation Cobalt (IO-)APIC。为了以统一的方式处理所有此类设备,Linux 使用PIC 对象,由 PIC 名称和七个 PIC 标准方法组成。这种面向对象方法的优点是驱动程序不需要了解系统中安装的 PIC 类型。每个驱动程序可见的中断源都透明地连接到适当的控制器。定义 PIC 对象的数据结构称为hw_interrupt_type(也称为hw_irq_controller)。
In addition to the 8259A chip that was mentioned near the
beginning of this chapter, Linux supports several other PIC circuits
such as the SMP IO-APIC, Intel PIIX4's internal 8259 PIC, and SGI's
Visual Workstation Cobalt (IO-)APIC. To handle all such devices in a
uniform way, Linux uses a PIC object,
consisting of the PIC name and seven PIC standard methods. The
advantage of this object-oriented approach is that drivers need not
to be aware of the kind of PIC installed in the system. Each
driver-visible interrupt source is transparently wired to the
appropriate controller. The data structure that defines a PIC object
is called hw_interrupt_type (also
called hw_irq_controller).
为了具体起见,我们假设我们的计算机是一个带有两个 8259A PIC 的单处理器,提供 16 个标准 IRQ。在本例中,handler
16 个描述符中的每个字段irq_desc_t都指向
i8259A_irq_type描述 8259A PIC 的变量。该变量初始化如下:
For the sake of concreteness, let's assume that our computer
is a uniprocessor with two 8259A PICs, which provide 16 standard
IRQs. In this case, the handler
field in each of the 16 irq_desc_t descriptors points to the
i8259A_irq_type variable, which
describes the 8259A PIC. This variable is initialized as
follows:
结构 hw_interrupt_type i8259A_irq_type = {
.typename = "XT-PIC",
.startup =startup_8259A_irq,
.shutdown = shutdown_8259A_irq,
.enable=enable_8259A_irq,
.disable=disable_8259A_irq,
.ack = mask_and_ack_8259A,
.end = end_8259A_irq,
.set_affinity = NULL
}; struct hw_interrupt_type i8259A_irq_type = {
.typename = "XT-PIC",
.startup = startup_8259A_irq,
.shutdown = shutdown_8259A_irq,
.enable = enable_8259A_irq,
.disable = disable_8259A_irq,
.ack = mask_and_ack_8259A,
.end = end_8259A_irq,
.set_affinity = NULL
};该结构中的第一个字段"XT-PIC"是 PIC 名称。接下来是指向用于对 PIC 进行编程的六个不同函数的指针。前两个函数分别启动和关闭芯片的 IRQ 线。但在 8259A 芯片的情况下,这些功能与第三和第四功能一致,即启用和禁用线路。该mask_and_ack_8259A(
)函数通过向 8259A I/O 端口发送正确的字节来确认接收到的 IRQ。end_8259A_irq( )当 IRQ 线的中断处理程序终止时,调用该函数。最后一个set_affinity方法设置为NULL:它在多处理器系统中用于声明 CPU 对指定 IRQ 的“亲和性”,即启用哪些 CPU 来处理特定 IRQ。
The first field in this structure, "XT-PIC", is the PIC name. Next come the
pointers to six different functions used to program the PIC. The
first two functions start up and shut down an IRQ line of the chip,
respectively. But in the case of the 8259A chip, these functions
coincide with the third and fourth functions, which enable and
disable the line. The mask_and_ack_8259A(
) function acknowledges the IRQ received by sending the
proper bytes to the 8259A I/O ports. The end_8259A_irq( ) function is invoked when
the interrupt handler for the IRQ line terminates. The last set_affinity method is set to NULL: it is used in multiprocessor systems
to declare the "affinity" of CPUs for specified IRQs — that is,
which CPUs are enabled to handle specific IRQs.
如前所述,多个设备可以共享单个 IRQ。因此,内核维护了irqaction描述符(参见本章前面的图4-5 ),每个描述符都引用一个特定的硬件设备和一个特定的中断。该描述符包含的字段如表4-6所示,标志如表4-7所示。
As described earlier, multiple devices can share a single IRQ.
Therefore, the kernel maintains irqaction descriptors (see Figure 4-5 earlier in this
chapter), each of which refers to a specific hardware device and a
specific interrupt. The fields included in such descriptor are shown
in Table 4-6, and
the flags are shown in Table 4-7.
表 4-6。irqaction描述符的字段
Table 4-6. Fields of the irqaction descriptor
字段名称 Field name | 描述 Description |
|---|---|
| 指向 I/O 设备的中断服务例程。这是允许许多设备共享相同 IRQ 的关键字段。 Points to the interrupt service routine for an I/O device. This is the key field that allows many devices to share the same IRQ. |
| 该字段包括几个描述 IRQ 线和 I/O 设备之间关系的字段(参见表4-7)。 This field includes a few fields that describe the relationships between the IRQ line and the I/O device (see Table 4-7). |
| 不曾用过。 Not used. |
| I/O 设备的名称(通过读取 /proc/interrupts文件列出所服务的 IRQ 时显示)。 The name of the I/O device (shown when listing the serviced IRQs by reading the /proc/interrupts file). |
| I/O 设备的私有字段。通常,它标识 I/O 设备本身(例如,它可以等于其主设备号和次设备号;请参阅第 13 章中的“设备文件” 部分),或者它指向设备驱动程序的数据。 A private field for the I/O device. Typically, it identifies the I/O device itself (for instance, it could be equal to its major and minor numbers; see the section "Device Files" in Chapter 13), or it points to the device driver's data. |
| 指向描述符列表的下一个元素 Points to the next element of a
list of |
中断 irq | IRQ 线。 IRQ line. |
目录 dir | 指向与 IRQ n关联的/proc/irq/n目录的描述符 。 Points to the descriptor of the /proc/irq/n directory associated with the IRQn. |
表 4-7。中断描述符的标志
Table 4-7. Flags of the irqaction descriptor
旗帜名称 Flag name | 描述 Description |
|---|---|
| 处理程序必须在禁用中断的情况下执行。 The handler must execute with interrupts disabled. |
| 该设备允许与其他设备共享其 IRQ 线路。 The device permits its IRQ line to be shared with other devices. |
| 该设备可被视为随机发生的事件源;因此它可以被内核随机数生成器使用。(用户可以通过从/dev/random和 /dev/urandom设备文件中获取随机数来访问此功能 。) The device may be considered a source of events that occurs randomly; it can thus be used by the kernel random number generator. (Users can access this feature by taking random numbers from the /dev/random and /dev/urandom device files.) |
最后,该irq_stat数组包含NR_CPUS条目,一个条目对应系统中每个可能的 CPU。每个类型的条目都irq_cpustat_t包含一些计数器和标志,内核使用这些计数器和标志来跟踪每个 CPU 当前正在执行的操作(参见表4-8)。
Finally, the irq_stat array
includes NR_CPUS entries, one for
every possible CPU in the system. Each entry of type irq_cpustat_t includes a few counters and
flags used by the kernel to keep track of what each CPU is currently
doing (see Table
4-8).
表 4-8。irq_cpustat_t 结构体的字段
Table 4-8. Fields of the irq_cpustat_t structure
字段名称 Field name | 描述 Description |
|---|---|
| 表示待处理软中断的标志集(请参阅本章后面的“软中断”部分) Set of flags denoting the pending softirqs (see the section "Softirqs" later in this chapter) |
| CPU 空闲的时间(仅当 CPU 当前空闲时才有意义) Time when the CPU became idle (significant only if the CPU is currently idle) |
| |
| 本地 APIC 定时器中断的发生次数(参见第 6 章) Number of occurrences of local APIC timer interrupts (see Chapter 6) |
Linux 坚持对称多处理模型(SMP);从本质上讲,这意味着内核不应该对某个 CPU 相对于其他 CPU 有任何偏见。因此,内核尝试以循环方式在所有 CPU 之间分配来自硬件设备的 IRQ 信号。因此,所有 CPU 应该花费大约相同比例的执行时间来服务 I/O 中断。
Linux sticks to the Symmetric Multiprocessing model (SMP ); this means, essentially, that the kernel should not have any bias toward one CPU with respect to the others. As a consequence, the kernel tries to distribute the IRQ signals coming from the hardware devices in a round-robin fashion among all the CPUs. Therefore, all the CPUs should spend approximately the same fraction of their execution time servicing I/O interrupts.
在前面的“高级可编程中断控制器 (APIC) ”部分中,我们说过多 APIC 系统具有复杂的机制,可以在 CPU 之间动态分配 IRQ 信号。
In the earlier section "The Advanced Programmable Interrupt Controller (APIC)," we said that the multi-APIC system has sophisticated mechanisms to dynamically distribute the IRQ signals among the CPUs.
在系统引导期间,引导CPU执行setup_IO_APIC_irqs( )初始化I/O APIC芯片的函数。芯片的中断重定向表的24个条目被填满,这样所有来自I/O硬件设备的IRQ信号就可以按照“最低优先级”的方案路由到系统中的每个CPU(参见前面章节“IRQs” )和中断”)。此外,在系统引导期间,所有 CPU 都会执行setup_local_APIC( )
函数,负责初始化本地 APIC。特别是,每个芯片的任务优先级寄存器(TPR)都被初始化为固定值,这意味着CPU愿意处理每种IRQ信号,无论其优先级如何。Linux 内核在初始化后永远不会修改该值。
During system bootstrap, the booting CPU executes the setup_IO_APIC_irqs( ) function to
initialize the I/O APIC chip. The 24 entries of the Interrupt
Redirection Table of the chip are filled, so that all IRQ signals
from the I/O hardware devices can be routed to each CPU in the
system according to the "lowest priority" scheme (see the earlier
section "IRQs and
Interrupts"). During system bootstrap, moreover, all CPUs
execute the setup_local_APIC( )
function, which takes care of initializing the local APICs. In
particular, the task priority register (TPR) of each chip is
initialized to a fixed value, meaning that the CPU is willing to
handle every kind of IRQ signal, regardless of its priority. The
Linux kernel never modifies this value after its
initialization.
所有任务优先级寄存器都包含相同的值,因此所有 CPU 始终具有相同的优先级。为了打破平局,多 APIC 系统使用本地 APIC 的仲裁优先级寄存器中的值,如前所述。由于这些值在每次中断后都会自动更改,因此在大多数情况下,IRQ 信号在所有 CPU 之间公平分配。[ * ]
All task priority registers contain the same value, thus all CPUs always have the same priority. To break a tie, the multi-APIC system uses the values in the arbitration priority registers of local APICs, as explained earlier. Because such values are automatically changed after every interrupt, the IRQ signals are, in most cases, fairly distributed among all CPUs.[*]
简而言之,当硬件设备发出 IRQ 信号时,多 APIC 系统会选择其中一个 CPU 并将信号传递给相应的本地 APIC,从而中断其 CPU。没有其他 CPU 收到该事件的通知。
In short, when a hardware device raises an IRQ signal, the multi-APIC system selects one of the CPUs and delivers the signal to the corresponding local APIC, which in turn interrupts its CPU. No other CPUs are notified of the event.
所有这些都是由硬件神奇地完成的,因此在多 APIC 系统初始化后内核不应该关心它。不幸的是,在某些情况下,硬件无法以公平的方式在微处理器之间分配中断(例如,某些基于 Pentium 4 的 SMP 主板存在此问题)。因此,Linux 2.6 使用了一个特殊的内核线程,称为 kirqd 如有必要,更正 IRQ 到 CPU 的自动分配。
All this is magically done by the hardware, so it should be of no concern for the kernel after multi-APIC system initialization. Unfortunately, in some cases the hardware fails to distribute the interrupts among the microprocessors in a fair way (for instance, some Pentium 4-based SMP motherboards have this problem). Therefore, Linux 2.6 makes use of a special kernel thread called kirqd to correct, if necessary, the automatic assignment of IRQs to CPUs.
内核线程利用了多 APIC 系统的一个很好的功能,称为 IRQ 亲和性CPU 的中断重定向表:通过修改 I/O APIC 的中断重定向表条目,可以将中断信号路由到特定的 CPU。这可以通过调用该函数来完成set_ioapic_affinity_irq( ),该函数作用于两个参数:要重新路由的 IRQ 向量和表示可以接收 IRQ 的 CPU 的 32 位掩码。系统管理员还可以通过将新的 CPU 位图掩码写入/proc/irq/n/smp_affinity文件(n是中断向量)来更改给定中断的 IRQ 关联性。
The kernel thread exploits a nice feature of multi-APIC
systems, called the IRQ affinity of a CPU: by modifying the Interrupt Redirection
Table entries of the I/O APIC, it is possible to route an interrupt
signal to a specific CPU. This can be done by invoking the set_ioapic_affinity_irq( ) function, which
acts on two parameters: the IRQ vector to be rerouted and a 32-bit
mask denoting the CPUs that can receive the IRQ. The IRQ affinity of
a given interrupt also can be changed by the system administrator by
writing a new CPU bitmap mask into the /proc/irq/n/smp_affinity file
(n being the interrupt vector).
kirqd内核线程定期执行该
函数,该函数跟踪最近时间间隔内每个 CPU 接收到的中断发生次数。do_irq_balance( )如果该函数发现负载最重的 CPU 和负载最轻的 CPU 之间的 IRQ 负载不平衡非常高,则它会选择一个 IRQ 从一个 CPU“移动”到另一个 CPU,或者在所有现有 CPU 之间轮换所有 IRQ。
The kirqd kernel thread periodically
executes the do_irq_balance( )
function, which keeps track of the number of interrupt occurrences
received by every CPU in the most recent time interval. If the
function discovers that the IRQ load imbalance between the heaviest
loaded CPU and the least loaded CPU is significantly high, then it
either selects an IRQ to be "moved" from a CPU to another, or
rotates all IRQs among all existing CPUs.
正如第 3章“识别进程”一节中提到的,根据编译内核时选择的选项,每个进程的描述符与由一个或两个页框组成的数据结构中的内核模式堆栈耦合。如果该结构的大小为 8 KB,则当前进程的内核模式堆栈用于每种类型的内核控制路径:异常、中断和可延迟函数(请参阅后面的“Softirqs 和Tasklet ”部分)。相反,如果结构的大小为 4 KB,则内核使用三种类型的内核模式堆栈:thread_infothread_unionthread_unionthread_union
As mentioned in the section "Identifying a Process"
in Chapter 3, the thread_info descriptor of each process is
coupled with a Kernel Mode stack in a thread_union data structure composed by
one or two page frames, according to an option selected when the
kernel has been compiled. If the size of the thread_union structure is 8 KB, the Kernel
Mode stack of the current process is used for every type of kernel
control path: exceptions, interrupts, and deferrable functions (see
the later section "Softirqs and Tasklets").
Conversely, if the size of the thread_union structure is 4 KB, the kernel
makes use of three types of Kernel Mode stacks:
异常堆栈 在处理异常(包括系统调用)时使用。这是包含在每个进程
thread_union数据结构中的堆栈,因此内核为系统中的每个进程使用不同的异常堆栈。
The exception stack is used when handling exceptions (including
system calls). This is the stack contained in the per-process
thread_union data structure,
thus the kernel makes use of a different exception stack for
each process in the system.
The hard IRQ stack is used when handling interrupts. There is one hard IRQ stack for each CPU in the system, and each stack is contained in a single page frame.
软IRQ堆栈 在处理可延迟函数(软中断或微线程;请参阅后面的“软中断和微线程”部分)时使用。系统中的每个CPU都有一个软IRQ堆栈,并且每个堆栈都包含在单个页框中。
The soft IRQ stack is used when handling deferrable functions (softirqs or tasklets; see the later section "Softirqs and Tasklets"). There is one soft IRQ stack for each CPU in the system, and each stack is contained in a single page frame.
所有硬 IRQ 堆栈都包含在该hardirq_stack数组中,而所有软 IRQ 堆栈都包含在该softirq_stack数组中。irq_ctx每个数组元素都是跨单个页面的类型的联合。该页的底部存储了一个thread_info结构体,而空闲的内存位置则用于堆栈;请记住,每个堆栈都向较低地址增长。因此,硬IRQ堆栈和软IRQ堆栈与第3章“识别进程”部分中描述的异常堆栈非常相似;唯一的区别是与每个堆栈耦合的结构与CPU而不是进程相关联。thread_info
All hard IRQ stacks are contained in the hardirq_stack array, while all soft IRQ
stacks are contained in the softirq_stack array. Each array element is
a union of type irq_ctx that span
a single page. At the bottom of this page is stored a thread_info structure, while the spare
memory locations are used for the stack; remember that each stack
grows towards lower addresses. Thus, hard IRQ stacks and soft IRQ
stacks are very similar to the exception stacks described in the
section "Identifying a
Process" in Chapter
3; the only difference is that the thread_info structure coupled with each
stack is associated with a CPU rather than a process.
hardirq_ctx和
数组softirq_ctx允许内核分别快速确定给定 CPU 的硬 IRQ 堆栈和软 IRQ 堆栈:它们包含指向相应irq_ctx
元素的指针。
The hardirq_ctx and
softirq_ctx arrays allow the
kernel to quickly determine the hard IRQ stack and soft IRQ stack of
a given CPU, respectively: they contain pointers to the
corresponding irq_ctx
elements.
当CPU接收到中断时,它开始执行在IDT相应门中找到的地址处的代码(参见前面的“中断和异常的硬件处理”部分)。
When a CPU receives an interrupt, it starts executing the code at the address found in the corresponding gate of the IDT (see the earlier section "Hardware Handling of Interrupts and Exceptions").
与其他上下文切换一样,保存寄存器的需要给内核开发人员带来了一些混乱的编码工作,因为必须使用汇编语言代码保存和恢复寄存器。然而,在这些操作中,处理器预计会调用 C 函数并从该函数返回。在本节中,我们描述处理寄存器的汇编语言任务;接下来,我们将展示随后调用的 C 函数中所需的一些杂技。
As with other context switches, the need to save registers leaves the kernel developer with a somewhat messy coding job, because the registers have to be saved and restored using assembly language code. However, within those operations, the processor is expected to call and return from a C function. In this section, we describe the assembly language task of handling registers; in the next, we show some of the acrobatics required in the C function that is subsequently invoked.
保存寄存器是中断处理程序的首要任务。正如已经提到的,IRQ n的中断处理程序的地址
最初存储在interrupt[n]条目中,然后复制到正确的 IDT 条目中包含的中断门中。
Saving registers is the first task of the interrupt handler.
As already mentioned, the address of the interrupt handler for IRQ
n is initially stored in the interrupt[n] entry and then copied into
the interrupt gate included in the proper IDT entry.
该数组是通过arch/i386/kernel/entry.Sinterrupt中的一些汇编语言指令构建的 文件。该数组包含NR_IRQS多个元素,如果内核支持最新的 I/O APIC 芯片,则宏NR_IRQS将生成数字 224;如果内核使用较旧的 8259A PIC 芯片,则宏将生成数字16。数组中索引n处的元素存储以下两条汇编语言指令的地址:
The interrupt array is
built through a few assembly language instructions in the arch/i386/kernel/entry.S file. The array includes NR_IRQS elements, where the NR_IRQS macro yields either the number 224
if the kernel supports a recent I/O APIC chip,[*] or the number 16 if the kernel uses the older 8259A
PIC chips. The element at index n in the array
stores the address of the following two assembly language
instructions:
普什尔 $ n -256
jmp 公共中断 pushl $n-256
jmp common_interrupt结果是在堆栈上保存与中断相关的 IRQ 号减 256。内核通过负数表示所有 IRQ,因为它保留正中断号来标识系统调用(请参阅第 10 章)。然后可以在引用该编号时执行所有中断处理程序的相同代码。公共代码从标签开始common_interrupt,由以下汇编语言宏和指令组成:
The result is to save on the stack the IRQ number associated
with the interrupt minus 256. The kernel represents all IRQs through
negative numbers, because it reserves positive interrupt numbers to
identify system calls (see Chapter 10). The same code for
all interrupt handlers can then be executed while referring to this
number. The common code starts at label common_interrupt and consists of the
following assembly language macros and instructions:
公共中断:
保存全部
movl %esp,%eax
调用do_IRQ
jmp ret_from_intr common_interrupt:
SAVE_ALL
movl %esp,%eax
call do_IRQ
jmp ret_from_intr该SAVE_ALL宏扩展为以下片段:
The SAVE_ALL macro expands
to the following fragment:
CLD
推 %es
推 %ds
推入%eax
推入%ebp
推 %edi
推 %esi
推 %edx
推 %ecx
推入%ebx
movl $ _ _USER_DS,%edx
movl %edx,%ds
movl %edx,%es cld
push %es
push %ds
pushl %eax
pushl %ebp
pushl %edi
pushl %esi
pushl %edx
pushl %ecx
pushl %ebx
movl $ _ _USER_DS,%edx
movl %edx,%ds
movl %edx,%esSAVE_ALL将中断处理程序可能使用的所有 CPU 寄存器保存在堆栈上,除了eflags 、cs、eip、ss和esp,它们已经由控制单元自动保存(参见前面的“中断和异常的硬件处理”部分)。然后宏将用户数据段的选择器加载到ds和中es。
SAVE_ALL saves all the CPU
registers that may be used by the interrupt handler on the stack,
except for eflags , cs, eip, ss, and esp, which are already saved automatically
by the control unit (see the earlier section "Hardware Handling of
Interrupts and Exceptions"). The macro then loads the
selector of the user data segment into ds and es.
保存寄存器后,将当前栈顶位置的地址保存到寄存器中eax;然后,中断处理程序调用该do_IRQ( )函数。当ret的指令
do_IRQ( )被执行时(当该函数终止时)控制权被转移到(参见后面的“从中断和异常返回ret_from_intr( )”部分)。
After saving the registers, the address of the current top
stack location is saved in the eax register; then, the interrupt handler
invokes the do_IRQ( ) function.
When the ret instruction of
do_IRQ( ) is executed (when that
function terminates) control is transferred to ret_from_intr( ) (see the later section
"Returning from Interrupts
and Exceptions").
调用该do_IRQ( )
函数执行所有中断服务程序与中断相关。声明如下:
The do_IRQ( )
function is invoked to execute all interrupt service
routines associated with an interrupt. It is declared as
follows:
_ _attribute_ _((regparm(3))) unsigned int do_IRQ(struct pt_regs *regs)
_ _attribute_ _((regparm(3))) unsigned int do_IRQ(struct pt_regs *regs)
关键字regparm指示函数去寄存器eax查找参数的值
regs;如上所示,
eax指向包含由 压入的最后一个寄存器值的堆栈位置SAVE_ALL。
The regparm keyword
instructs the function to go to the eax register to find the value of the
regs argument; as seen above,
eax points to the stack location
containing the last register value pushed on by SAVE_ALL.
该do_IRQ( )函数执行以下操作:
The do_IRQ( ) function
executes the following actions:
执行irq_enter(
)宏,这会增加表示嵌套中断处理程序数量的计数器。计数器存储在当前进程的结构体preempt_count字段中(参见本章后面的表4-10 )。thread_info
Executes the irq_enter(
) macro, which increases a counter representing the
number of nested interrupt handlers. The counter is stored in
the preempt_count field of
the thread_info structure of
the current process (see Table 4-10 later in
this chapter).
如果该结构的大小thread_union为 4 KB,则切换到硬 IRQ 堆栈。具体来说,该函数执行以下子步骤:
执行该函数以获取与寄存器寻址的内核模式堆栈关联的描述符current_thread_info( )的地址(请参阅第 3 章中的“识别进程”部分)。thread_infoesp
thread_info将上一步得到的描述符的地址与 中存储的地址进行比较hardirq_ctx[smp_processor_id( )],即thread_info与本地CPU关联的描述符的地址。如果两个地址相等,则内核已经在使用硬 IRQ 堆栈,因此跳转到步骤 3。当内核仍在处理另一个中断时引发 IRQ 时,就会发生这种情况。
这里必须切换内核模式堆栈。将指向当前进程描述符的指针存储在
本地CPU联合描述符
task的字段
中。这样做是为了在内核使用硬 IRQ 堆栈时宏按预期工作(请参阅第 3 章中的“识别进程”部分)。thread_infoirq_ctxcurrent
esp将堆栈指针寄存器
的当前值存储在本地 CPU 联合的描述符previous_esp字段
中(该字段仅在为内核 oops 准备函数调用跟踪时使用)。thread_infoirq_ctx
esp
将本地CPU的硬IRQ堆栈的顶部位置加载到堆栈寄存器中(其中的值hardirq_ctx[smp_processor_id( )]
加4096);寄存器的先前值esp保存在
ebx寄存器中。
If the size of the thread_union structure is 4 KB, it
switches to the hard IRQ stack.In particular, the function
performs the following substeps:
Executes the current_thread_info( ) function to
get the address of the thread_info descriptor associated
with the Kernel Mode stack addressed by the esp register (see the section
"Identifying a
Process" in Chapter 3).
Compares the address of the thread_info descriptor obtained in
the previous step with the address stored in hardirq_ctx[smp_processor_id( )],
that is, the address of the thread_info descriptor associated
with the local CPU. If the two addresses are equal, the
kernel is already using the hard IRQ stack, thus jumps to
step 3. This happens when an IRQ is raised while the kernel
is still handling another interrupt.
Here the Kernel Mode stack has to be switched. Stores
the pointer to the current process descriptor in the
task field of the
thread_info descriptor in
irq_ctx union of the
local CPU. This is done so that the current macro works as expected
while the kernel is using the hard IRQ stack (see the
section "Identifying a
Process" in Chapter 3).
Stores the current value of the esp stack pointer register in the
previous_esp field of the
thread_info descriptor in
the irq_ctx union of the
local CPU (this field is used only when preparing the
function call trace for a kernel oops).
Loads in the esp
stack register the top location of the hard IRQ stack of the
local CPU (the value in hardirq_ctx[smp_processor_id( )]
plus 4096); the previous value of the esp register is saved in the
ebx register.
调用_ _do_IRQ(
)向其传递指针regs和从字段获取的 IRQ 号的函数regs->orig_eax(请参阅以下部分)。
Invokes the _ _do_IRQ(
) function passing to it the pointer regs and the IRQ number obtained from
the regs->orig_eax field
(see the following section).
如果硬IRQ堆栈已在上面的步骤2e中有效切换,则该函数将原始堆栈指针从寄存器复制ebx到esp寄存器中,从而切换回之前使用的异常堆栈或软IRQ堆栈。
If the hard IRQ stack has been effectively switched in
step 2e above, the function copies the original stack pointer
from the ebx register into
the esp register, thus
switching back to the exception stack or soft IRQ stack that
were in use before.
执行irq_exit(
)宏,该宏会减少中断计数器并检查可延迟内核函数是否正在等待执行(请参阅本章后面的“ Softirqs 和 Tasklet ”部分)。
Executes the irq_exit(
) macro, which decreases the interrupt counter and
checks whether deferrable kernel functions are waiting to be
executed (see the section "Softirqs and
Tasklets" later in this chapter).
终止:控制权转移到函数(参见后面的“从中断和异常返回ret_from_intr( )”一节)。
Terminates: the control is transferred to the ret_from_intr( ) function (see the
later section "Returning from Interrupts
and Exceptions").
该_ _do_IRQ( )
函数接收一个 IRQ 号(通过
寄存器)和一个指向保存用户模式寄存器值的结构的eax指针(通过寄存器)作为其参数。pt_regsedx
The _ _do_IRQ( )
function receives as its parameters an IRQ number (through the
eax register) and a pointer to
the pt_regs structure where the
User Mode register values have been saved (through the edx register).
该函数相当于下面的代码片段:
The function is equivalent to the following code fragment:
spin_lock(&(irq_desc[irq].lock));
irq_desc[irq].handler->ack(irq);
irq_desc[irq].status &= ~(IRQ_REPLAY | IRQ_WAITING);
irq_desc[irq].status |= IRQ_PENDING;
if (!(irq_desc[irq].status & (IRQ_DISABLED | IRQ_INPROGRESS))
&& irq_desc[irq].action) {
irq_desc[irq].status |= IRQ_INPROGRESS;
做 {
irq_desc[irq].status &= ~IRQ_PENDING;
spin_unlock(&(irq_desc[irq].lock));
handle_IRQ_event(irq, regs, irq_desc[irq].action);
spin_lock(&(irq_desc[irq].lock));
while (irq_desc[irq].status & IRQ_PENDING);
irq_desc[irq].status &= ~IRQ_INPROGRESS;
}
irq_desc[irq].handler->end(irq);
spin_unlock(&(irq_desc[irq].lock)); spin_lock(&(irq_desc[irq].lock));
irq_desc[irq].handler->ack(irq);
irq_desc[irq].status &= ~(IRQ_REPLAY | IRQ_WAITING);
irq_desc[irq].status |= IRQ_PENDING;
if (!(irq_desc[irq].status & (IRQ_DISABLED | IRQ_INPROGRESS))
&& irq_desc[irq].action) {
irq_desc[irq].status |= IRQ_INPROGRESS;
do {
irq_desc[irq].status &= ~IRQ_PENDING;
spin_unlock(&(irq_desc[irq].lock));
handle_IRQ_event(irq, regs, irq_desc[irq].action);
spin_lock(&(irq_desc[irq].lock));
} while (irq_desc[irq].status & IRQ_PENDING);
irq_desc[irq].status &= ~IRQ_INPROGRESS;
}
irq_desc[irq].handler->end(irq);
spin_unlock(&(irq_desc[irq].lock));在访问主IRQ描述符之前,内核会获取相应的自旋锁。我们将在第 5 章中看到,自旋锁可以防止不同 CPU 的并发访问。这种自旋锁在多处理器系统中是必要的,因为可能会引发其他同类中断,并且其他 CPU 可能会处理新中断的发生。如果没有自旋锁,主 IRQ 描述符将被多个 CPU 同时访问。正如我们将看到的,这种情况必须绝对避免。
Before accessing the main IRQ descriptor, the kernel acquires the corresponding spin lock. We'll see in Chapter 5 that the spin lock protects against concurrent accesses by different CPUs. This spin lock is necessary in a multiprocessor system, because other interrupts of the same kind may be raised, and other CPUs might take care of the new interrupt occurrences. Without the spin lock, the main IRQ descriptor would be accessed concurrently by several CPUs. As we'll see, this situation must be absolutely avoided.
获取自旋锁后,该函数调用
ack主IRQ描述符的方法。当使用旧的 8259A PIC 时,相应的mask_and_ack_8259A( )函数会确认 PIC 上的中断并禁用 IRQ 线。屏蔽 IRQ 线可确保 CPU 在处理程序终止之前不再接受此类中断的发生。请记住,该_ _do_IRQ(
)函数在禁用本地中断的情况下运行;事实上,CPU控制单元自动清除IF标志eflags 因为中断处理程序是通过 IDT 的中断门调用的。然而,我们很快就会看到内核可能会在执行该中断的中断服务例程之前重新启用本地中断。
After acquiring the spin lock, the function invokes the
ack method of the main IRQ
descriptor. When using the old 8259A PIC, the corresponding mask_and_ack_8259A( ) function
acknowledges the interrupt on the PIC and also disables the IRQ
line. Masking the IRQ line ensures that the CPU does not accept
further occurrences of this type of interrupt until the handler
terminates. Remember that the _ _do_IRQ(
) function runs with local interrupts disabled; in fact,
the CPU control unit automatically clears the IF flag of the eflags register because the interrupt handler is invoked
through an IDT's interrupt gate. However, we'll see shortly that the
kernel might re-enable local interrupts before executing the
interrupt service routines of this interrupt.
然而,当使用 I/O APIC 时,事情就复杂得多。根据中断的类型,确认中断可以由该ack方法完成,也可以延迟到中断处理程序终止(即可以由该
end方法完成确认)。在任何一种情况下,我们都可以理所当然地认为本地 APIC 在处理程序终止之前不会再接受此类中断,尽管其他 CPU 可能会接受此类中断的进一步出现。
When using the I/O APIC, however, things are much more
complicated. Depending on the type of interrupt, acknowledging the
interrupt could either be done by the ack method or delayed until the interrupt
handler terminates (that is, acknowledgement could be done by the
end method). In either case, we
can take for granted that the local APIC doesn't accept further
interrupts of this type until the handler terminates, although
further occurrences of this type of interrupt may be accepted by
other CPUs.
然后该_ _do_IRQ( )函数初始化主 IRQ 描述符的一些标志。它设置
IRQ_PENDING标志是因为中断已被确认(嗯,有点),但尚未真正得到服务;它还清除IRQ_WAITING和IRQ_REPLAY标志(但我们现在不必关心它们)。
The _ _do_IRQ( ) function
then initializes a few flags of the main IRQ descriptor. It sets the
IRQ_PENDING flag because the
interrupt has been acknowledged (well, sort of), but not yet really
serviced; it also clears the IRQ_WAITING and IRQ_REPLAY flags (but we don't have to
care about them now).
现在 _ _do_IRQ( )检查它是否必须真正处理中断。有三种情况无需采取任何措施。这些将在下面的列表中讨论。
Now _ _do_IRQ( ) checks
whether it must really handle the interrupt. There are three cases
in which nothing has to be done. These are discussed in the
following list.
IRQ_DISABLED
已设置IRQ_DISABLED
is set_
_do_IRQ( )即使相应的 IRQ 线被禁用,CPU 也可能会执行该功能;您将在后面的“恢复丢失的中断”部分中找到对这种不直观情况的解释。此外,即使 PIC 中的 IRQ 线被禁用,有缺陷的主板也可能会产生虚假中断。
A CPU might execute the _
_do_IRQ( ) function even if the corresponding IRQ
line is disabled; you'll find an explanation for this
nonintuitive case in the later section "Reviving a lost
interrupt." Moreover, buggy motherboards may generate
spurious interrupts even when the IRQ line is disabled in the
PIC.
IRQ_INPROGRESS
已设置IRQ_INPROGRESS
is set在多处理器系统中,另一个 CPU 可能正在处理先前发生的同一中断。为什么不将这种情况
的处理推迟到该CPU 上呢?这正是 Linux 所做的。这导致了更简单的内核体系结构,因为设备驱动程序的中断服务例程不需要可重入(它们的执行是串行的)。而且,被释放的CPU可以快速返回到它正在做的事情,而不会弄脏它的硬件缓存;这有利于系统性能。这IRQ_INPROGRESS每当CPU致力于执行中断的中断服务例程时,该标志就被设置;因此,该_ _do_IRQ( )函数在开始实际工作之前会对其进行检查。
In a multiprocessor system, another CPU might be
handling a previous occurrence of the same interrupt. Why not
defer the handling of this occurrence to
that CPU? This is exactly what is done by
Linux. This leads to a simpler kernel architecture because
device drivers' interrupt service routines need not to be
reentrant (their execution is serialized). Moreover, the freed
CPU can quickly return to what it was doing, without dirtying
its hardware cache; this is beneficial to system performance.
The IRQ_INPROGRESS flag is
set whenever a CPU is committed to execute the interrupt
service routines of the interrupt; therefore, the _ _do_IRQ( ) function checks it
before starting the real work.
irq_desc[irq].action
是 NULLirq_desc[irq].action
is NULL当没有与中断相关的中断服务例程时,就会发生这种情况。通常,只有当内核探测硬件设备时才会发生这种情况。
This case occurs when there is no interrupt service routine associated with the interrupt. Normally, this happens only when the kernel is probing a hardware device.
假设这三种情况都不成立,因此必须处理中断。__do_IRQ( )函数设置IRQ_INPROGRESS标志并启动循环。在每次迭代中,该函数都会清除标志IRQ_PENDING,释放中断自旋锁,并通过调用执行中断服务例程
handle_IRQ_event( )(本章稍后介绍)。当后一个函数终止时,_ _do_IRQ( )再次获取自旋锁并检查标志的值IRQ_PENDING。如果清楚,则没有进一步发生的中断已传递给另一个 CPU,因此循环结束。相反,如果IRQ_PENDING设置了该位,则do_IRQ( )当该 CPU 正在执行时,另一个 CPU 已经执行了该类型中断的函数handle_IRQ_event( )。因此,do_IRQ( )执行循环的另一次迭代,为新发生的中断提供服务。[ * ]
Let's suppose that none of the three cases holds, so the
interrupt has to be serviced. The _ _do_IRQ( ) function sets the IRQ_INPROGRESS flag and starts a loop. In
each iteration, the function clears the IRQ_PENDING flag, releases the interrupt
spin lock, and executes the interrupt service routines by invoking
handle_IRQ_event( ) (described
later in the chapter). When the latter function terminates, _ _do_IRQ( ) acquires the spin lock again
and checks the value of the IRQ_PENDING flag. If it is clear, no
further occurrence of the interrupt has been delivered to another
CPU, so the loop ends. Conversely, if IRQ_PENDING is set, another CPU has
executed the do_IRQ( ) function
for this type of interrupt while this CPU was executing handle_IRQ_event( ). Therefore, do_IRQ( ) performs another iteration of
the loop, servicing the new occurrence of the interrupt.[*]
我们的_ _do_IRQ( )函数现在将终止,要么是因为它已经执行了中断服务例程,要么是因为它没有任何事情可做。该函数调用end主 IRQ 描述符的方法。当使用旧的 8259A PIC 时,相应的end_8259A_irq( )
函数会重新启用 IRQ 线(除非中断发生是虚假的)。使用 I/O APIC 时,该end方法会确认中断(如果该ack
方法尚未完成)。
Our _ _do_IRQ( ) function
is now going to terminate, either because it has already executed
the interrupt service routines or because it had nothing to do. The
function invokes the end method
of the main IRQ descriptor. When using the old 8259A PIC, the
corresponding end_8259A_irq( )
function reenables the IRQ line (unless the interrupt occurrence was
spurious). When using the I/O APIC, the end method acknowledges the interrupt (if
not already done by the ack
method).
最后,_ _do_IRQ( )
释放自旋锁:辛苦的工作完成了!
Finally, _ _do_IRQ( )
releases the spin lock: the hard work is finished!
该_ _do_IRQ( )函数虽小且简单,但在大多数情况下都能正常工作。事实上,IRQ_PENDING、IRQ_INPROGRESS和IRQ_DISABLED标志确保即使硬件行为异常也能正确处理中断。然而,在多处理器系统中事情可能不会那么顺利。
The _ _do_IRQ( ) function
is small and simple, yet it works properly in most cases. Indeed,
the IRQ_PENDING, IRQ_INPROGRESS, and IRQ_DISABLED flags ensure that interrupts
are correctly handled even when the hardware is misbehaving.
However, things may not work so smoothly in a multiprocessor
system.
假设 CPU 启用了 IRQ 线。硬件设备发出 IRQ 线,多 APIC 系统选择我们的 CPU 来处理中断。在CPU确认中断之前,IRQ线被另一个CPU屏蔽;结果,该
IRQ_DISABLED标志被设置。之后,我们的 CPU 开始处理待处理的中断;因此,该do_IRQ( )函数确认中断,然后返回而不执行中断服务例程,因为它发现了IRQ_DISABLED标志设置。因此,即使中断发生在 IRQ 线被禁用之前,它也会丢失。
Suppose that a CPU has an IRQ line enabled. A hardware device
raises the IRQ line, and the multi-APIC system selects our CPU for
handling the interrupt. Before the CPU acknowledges the interrupt,
the IRQ line is masked out by another CPU; as a consequence, the
IRQ_DISABLED flag is set. Right
afterwards, our CPU starts handling the pending interrupt;
therefore, the do_IRQ( ) function
acknowledges the interrupt and then returns without executing the
interrupt service routines because it finds the IRQ_DISABLED flag set. Therefore, even
though the interrupt occurred before the IRQ line was disabled, it
gets lost.
为了应对这种情况,enable_irq( )内核用来启用 IRQ 线的函数首先检查中断是否丢失。如果是这样,该函数会强制硬件生成新的丢失中断:
To cope with this scenario, the enable_irq( ) function, which is used by
the kernel to enable an IRQ line, checks first whether an interrupt
has been lost. If so, the function forces the hardware to generate a
new occurrence of the lost interrupt:
spin_lock_irqsave(&(irq_desc[irq].lock), flags);
if (--irq_desc[irq].深度 == 0) {
irq_desc[irq].status &= ~IRQ_DISABLED;
if (irq_desc[irq].status & (IRQ_PENDING | IRQ_REPLAY))
== IRQ_PENDING) {
irq_desc[irq].status |= IRQ_REPLAY;
hw_resend_irq(irq_desc[irq].handler,irq);
}
irq_desc[irq].handler->enable(irq);
}
spin_lock_irqrestore(&(irq_desc[irq].lock), flags); spin_lock_irqsave(&(irq_desc[irq].lock), flags);
if (--irq_desc[irq].depth == 0) {
irq_desc[irq].status &= ~IRQ_DISABLED;
if (irq_desc[irq].status & (IRQ_PENDING | IRQ_REPLAY))
== IRQ_PENDING) {
irq_desc[irq].status |= IRQ_REPLAY;
hw_resend_irq(irq_desc[irq].handler,irq);
}
irq_desc[irq].handler->enable(irq);
}
spin_lock_irqrestore(&(irq_desc[irq].lock), flags);该函数通过检查标志的值来检测中断丢失IRQ_PENDING
。当离开中断处理程序时,该标志总是被清除;因此,如果 IRQ 线被禁用并且标志被设置,则中断发生已被确认但尚未得到服务。在这种情况下,该hw_resend_irq( )
函数会引发一个新的中断。这是通过强制本地 APIC 生成自中断来获得的(请参阅后面的“处理器间中断处理”部分)。该IRQ_REPLAY标志的作用是确保恰好生成一个自中断。请记住,__do_IRQ( )函数在开始处理中断时会清除该标志。
The function detects that an interrupt was lost by checking
the value of the IRQ_PENDING
flag. The flag is always cleared when leaving the interrupt handler;
therefore, if the IRQ line is disabled and the flag is set, then an
interrupt occurrence has been acknowledged but not yet serviced. In
this case the hw_resend_irq( )
function raises a new interrupt. This is obtained by forcing the
local APIC to generate a self-interrupt (see the later section
"Interprocessor
Interrupt Handling"). The role of the IRQ_REPLAY flag is to ensure that exactly
one self-interrupt is generated. Remember that the _ _do_IRQ( ) function clears that flag when
it starts handling the interrupt.
如前所述,中断服务例程通过执行特定于一种类型的设备的操作来处理中断。当中断处理程序必须执行 ISR 时,它会调用该handle_IRQ_event( )
函数。该函数主要执行以下步骤:
As mentioned previously, an interrupt service routine
handles an interrupt by executing an operation specific to one type
of device. When an interrupt handler must execute the ISRs, it
invokes the handle_IRQ_event( )
function. This function essentially performs the following
steps:
Enables the local interrupts with the sti assembly language instruction if the SA_INTERRUPT flag is clear.
通过以下代码执行中断的各个中断服务程序:
检索值=0;
做 {
retval |= 操作->处理程序(irq, 操作->dev_id, regs);
动作=动作->下一步;
while(动作);在循环开始时,action指向irqaction数据结构列表的开头,这些数据结构指示收到中断后要采取的操作(参见本章前面的图 4-5 )。
Executes each interrupt service routine of the interrupt through the following code:
retval = 0;
do {
retval |= action->handler(irq, action->dev_id, regs);
action = action->next;
} while (action);At the start of the loop, action points to the start of a list
of irqaction data structures
that indicate the actions to be taken upon receiving the
interrupt (see Figure
4-5 earlier in this chapter).
Disables local interrupts with the cli assembly language instruction.
通过返回局部变量的值来终止retval,即,如果没有中断服务程序识别出中断,则返回 0,否则返回 1(见下文)。
Terminates by returning the value of the retval local variable, that is, 0 if
no interrupt service routine has recognized interrupt, 1
otherwise (see next).
所有中断服务例程都作用于相同的参数(它们再次分别通过eax、edx和ecx寄存器传递):
All interrupt service routines act on the same parameters
(once again they are passed through the eax, edx, and ecx registers, respectively):
irqirqIRQ 号
The IRQ number
dev_iddev_id设备标识符
The device identifier
regsregs指向内核模式(异常)堆栈上的结构的指针pt_regs,其中包含中断发生后立即保存的寄存器。该pt_regs结构由 15 个字段组成:
前九个字段是由以下命令推送的寄存器值SAVE_ALL
通过名为 的字段引用的第十个字段
orig_eax对 IRQ 编号进行编码
其余字段对应于控制单元自动推送的寄存器值
A pointer to a pt_regs structure on the Kernel Mode
(exception) stack containing the registers saved right after
the interrupt occurred. The pt_regs structure consists of 15
fields:
The first nine fields are the register values pushed
by SAVE_ALL
The tenth field, referenced through a field called
orig_eax, encodes the
IRQ number
The remaining fields correspond to the register values pushed on automatically by the control unit
第一个参数允许单个 ISR 处理多个 IRQ 线,第二个参数允许单个 ISR 处理多个相同类型的设备,最后一个参数允许 ISR 访问中断的内核控制路径的执行上下文。实际上,大多数 ISR 不使用这些参数。
The first parameter allows a single ISR to handle several IRQ lines, the second one allows a single ISR to take care of several devices of the same type, and the last one allows the ISR to access the execution context of the interrupted kernel control path. In practice, most ISRs do not use these parameters.
如果中断已被有效处理,也就是说,如果信号是由中断服务例程处理的硬件设备(而不是共享同一 IRQ 的另一个设备)发出的,则每个中断服务例程都会返回值 1;否则返回值 0。该返回代码允许内核更新本章前面“ IRQ 数据结构”部分中提到的意外中断的计数器。
Every interrupt service routine returns the value 1 if the interrupt has been effectively handled, that is, if the signal was raised by the hardware device handled by the interrupt service routine (and not by another device sharing the same IRQ); it returns the value 0 otherwise. This return code allows the kernel to update the counter of unexpected interrupts mentioned in the section "IRQ data structures" earlier in this chapter.
主 IRQ 描述符的标志决定函数调用 ISRSA_INTERRUPT时是否必须启用或禁用中断。do_IRQ(
)允许在中断处于一种状态时调用的 ISR 将中断置于相反的状态。在单处理器系统中,这可以通过cli(禁用中断)和sti(启用中断)汇编语言指令来实现。
The SA_INTERRUPT flag of
the main IRQ descriptor determines whether interrupts must be
enabled or disabled when the do_IRQ(
) function invokes an ISR. An ISR that has been invoked
with the interrupts in one state is allowed to put them in the
opposite state. In a uniprocessor system, this can be achieved by
means of the cli (disable
interrupts) and sti (enable
interrupts) assembly language instructions.
ISR 的结构取决于所处理的设备的特性。我们将在第 6 章和第 13 章中给出几个 ISR 示例。
The structure of an ISR depends on the characteristics of the device handled. We'll give a couple of examples of ISRs in Chapter 6 and Chapter 13.
如“中断向量”部分所述,一些向量是为特定设备保留的,而其余向量是动态处理的。因此,有一种方法可以让多个硬件设备使用同一条 IRQ 线,即使它们不允许 IRQ 共享。诀窍是串行化硬件设备的激活,以便一次只有一个设备拥有 IRQ 线。
As noted in section "Interrupt vectors," a few vectors are reserved for specific devices, while the remaining ones are dynamically handled. There is, therefore, a way in which the same IRQ line can be used by several hardware devices even if they do not allow IRQ sharing. The trick is to serialize the activation of the hardware devices so that just one owns the IRQ line at a time.
在激活要使用 IRQ 线的设备之前,相应的驱动程序会调用request_irq( ). 该函数创建一个新的irqaction描述符并使用参数值对其进行初始化;然后它调用该
setup_irq( )函数将描述符插入到正确的 IRQ 列表中。如果返回错误代码,设备驱动程序将中止操作setup_irq( ),这通常意味着 IRQ 线已被另一个不允许中断共享的设备使用。当设备操作结束时,驱动程序调用该free_irq( )函数从IRQ列表中删除描述符并释放内存区域。
Before activating a device that is going to use an IRQ line,
the corresponding driver invokes request_irq( ). This function creates a
new irqaction descriptor and
initializes it with the parameter values; it then invokes the
setup_irq( ) function to insert
the descriptor in the proper IRQ list. The device driver aborts the
operation if setup_irq( ) returns
an error code, which usually means that the IRQ line is already in
use by another device that does not allow interrupt sharing. When
the device operation is concluded, the driver invokes the free_irq( ) function to remove the
descriptor from the IRQ list and release the memory area.
让我们通过一个简单的例子来看看这个方案是如何工作的。假设一个程序想要寻址/dev/fd0设备文件,它对应于系统上的第一张软盘。[ * ]程序可以通过直接访问 /dev/fd0或在其上安装文件系统来完成此操作。软盘控制器通常分配 IRQ 6;鉴于此,软盘驱动程序可能会发出以下请求:
Let's see how this scheme works on a simple example. Assume a program wants to address the /dev/fd0 device file, which corresponds to the first floppy disk on the system.[*] The program can do this either by directly accessing /dev/fd0 or by mounting a filesystem on it. Floppy disk controllers are usually assigned IRQ 6; given this, a floppy driver may issue the following request:
request_irq(6, 软盘中断,
SA_INTERRUPT | SA_SAMPLE_RANDOM,“软盘”,NULL); request_irq(6, floppy_interrupt,
SA_INTERRUPT|SA_SAMPLE_RANDOM, "floppy", NULL);可以看出,floppy_interrupt( )中断服务例程必须在禁用中断(SA_INTERRUPT设置标志)并且不共享 IRQ(SA_SHIRQ缺少标志)的情况下执行。设置标志SA_SAMPLE_RANDOM意味着对软盘的访问是用于内核随机数生成器的随机事件的良好来源。当软盘上的操作结束时(/dev/fd0上的 I/O 操作终止或文件系统被卸载),驱动程序释放 IRQ 6:
As can be observed, the floppy_interrupt( ) interrupt service
routine must execute with the interrupts disabled (SA_INTERRUPT flag set) and no sharing of
the IRQ (SA_SHIRQ flag missing).
The SA_SAMPLE_RANDOM flag set
means that accesses to the floppy disk are a good source of random
events to be used for the kernel random number generator. When the
operation on the floppy disk is concluded (either the I/O operation
on /dev/fd0 terminates or the
filesystem is unmounted), the driver releases IRQ 6:
free_irq(6, 空);
free_irq(6, NULL);
为了将irqaction
描述符插入正确的列表中,内核调用该setup_irq( )函数,并向其传递参数irq _nr、IRQ 号和new(先前分配的描述符的地址irqaction
)。这个功能:
To insert an irqaction
descriptor in the proper list, the kernel invokes the setup_irq( ) function, passing to it the
parameters irq _nr, the IRQ
number, and new (the address of a
previously allocated irqaction
descriptor). This function:
检查另一个设备是否已在使用该
irq _nrIRQ,如果是,则检查两个设备描述符SA_SHIRQ中irqaction的标志是否指定可以共享 IRQ 线。如果无法使用 IRQ 线,则返回错误代码。
Checks whether another device is already using the
irq _nr IRQ and, if so,
whether the SA_SHIRQ flags in
the irqaction descriptors of
both devices specify that the IRQ line can be shared. Returns an
error code if the IRQ line cannot be used.
在指向的列表末尾添加(指向的*new新
描述符) 。irqactionnewirq _desc[irq
_nr]->action
Adds *new (the new
irqaction descriptor pointed
to by new) at the end of the
list to which irq _desc[irq
_nr]->action points.
如果没有其他设备共享相同的 IRQ,则该函数会清除字段中的IRQ _DISABLED、
IRQ_AUTODETECT、IRQ_WAITING和IRQ _INPROGRESS标志,并调用PIC 对象的方法以确保 IRQ 信号已启用。flags*newstartupirq_desc[irq_nr]->handler
If no other device is sharing the same IRQ, the function
clears the IRQ _DISABLED,
IRQ_AUTODETECT, IRQ_WAITING, and IRQ _INPROGRESS flags in the flags field of *new and invokes the startup method of the irq_desc[irq_nr]->handler PIC
object to make sure that IRQ signals are enabled.
下面是一个如何使用的示例setup_irq(
),取自系统初始化。内核irq0通过在函数中执行以下指令来初始化间隔定时器设备的描述符time_init( )(参见
第6章):
Here is an example of how setup_irq(
) is used, drawn from system initialization. The kernel
initializes the irq0 descriptor
of the interval timer device by executing the following instructions
in the time_init( ) function (see
Chapter 6):
结构irqaction irq0 =
{timer_interrupt, SA_INTERRUPT, 0, “定时器”, NULL, NULL};
setup_irq(0, &irq0); struct irqaction irq0 =
{timer_interrupt, SA_INTERRUPT, 0, "timer", NULL, NULL};
setup_irq(0, &irq0);首先,初始化irq0type 的变量:该字段设置为
函数的地址,该字段设置为,该字段设置为“timer”,第五个字段设置为 ,
表示没有使用任何值。接下来,内核调用插入
与 IRQ 0 关联的描述符列表。irqactionhandlertimer_interrupt( )flagsSA_INTERRUPTnameNULLdev_idsetup_irq( )irq0irqaction
First, the irq0 variable of
type irqaction is initialized:
the handler field is set to the
address of the timer_interrupt( )
function, the flags field is set
to SA_INTERRUPT, the name field is set to "timer", and the fifth field is set to
NULL to show that no dev_id value is used. Next, the kernel
invokes setup_irq( ) to insert
irq0 in the list of irqaction descriptors associated with IRQ
0.
处理器间中断允许 CPU 向系统中的任何其他 CPU 发送中断信号。正如本章前面的“高级可编程中断控制器 (APIC)”部分所述,处理器间中断 (IPI) 不是通过 IRQ 线传递,而是直接作为连接所有 CPU 的本地 APIC 的总线上的消息传递。(旧主板中的专用总线,或基于 Pentium 4 的主板中的系统总线)。
Interprocessor interrupts allow a CPU to send interrupt signals to any other CPU in the system. As explained in the section "The Advanced Programmable Interrupt Controller (APIC)" earlier in this chapter, an interprocessor interrupt (IPI) is delivered not through an IRQ line, but directly as a message on the bus that connects the local APIC of all CPUs (either a dedicated bus in older motherboards, or the system bus in the Pentium 4-based motherboards).
在多处理器系统上,Linux 使用三种处理器间中断(另请参见表4-2):
On multiprocessor systems, Linux makes use of three kinds of interprocessor interrupts (see also Table 4-2):
CALL_FUNCTION_VECTOR
(矢量 0xfb)CALL_FUNCTION_VECTOR
(vector 0xfb)发送到除发送者之外的所有 CPU,强制这些 CPU 运行发送者传递的函数。相应的中断处理程序被命名为call_function_interrupt( )。例如,该函数(其地址在全局变量中传递call_data)可能会强制所有其他 CPU 停止,或者可能强制它们设置内存类型范围寄存器 (MTRR) 的内容。[ * ]通常,该中断通过smp_call_function( )设施函数发送到除执行调用函数的CPU 之外的所有CPU。
Sent to all CPUs but the sender, forcing those CPUs to run
a function passed by the sender. The corresponding interrupt
handler is named call_function_interrupt( ). The
function (whose address is passed in the call_data global variable) may, for
instance, force all other CPUs to stop, or may force them to set
the contents of the Memory Type Range Registers
(MTRRs).[*] Usually this interrupt is sent to all CPUs except
the CPU executing the calling function by means of the smp_call_function( ) facility
function.
RESCHEDULE_VECTOR
(矢量 0xfc)RESCHEDULE_VECTOR
(vector 0xfc)当 CPU 接收到这种类型的中断时,相应的处理程序(命名为reschedule_interrupt( ))将自身限制为确认该中断。从中断返回时会自动完成重新调度(请参阅本章后面的“从中断和异常返回”部分)。
When a CPU receives this type of interrupt, the
corresponding handler — named reschedule_interrupt( ) — limits
itself to acknowledging the interrupt. Rescheduling is done
automatically when returning from the interrupt (see the section
"Returning from
Interrupts and Exceptions" later in this chapter).
INVALIDATE_TLB_VECTOR
(矢量 0xfd)INVALIDATE_TLB_VECTOR
(vector 0xfd)发送到除发送者之外的所有 CPU,迫使它们使其翻译后备缓冲区无效。相应的处理程序(名为 )会刷新处理器的一些 TLB 条目,如第 2 章“处理硬件高速缓存和 TLB ”invalidate_interrupt( )部分所述。
Sent to all CPUs but the sender, forcing them to
invalidate their Translation Lookaside Buffers. The
corresponding handler, named invalidate_interrupt( ), flushes some
TLB entries of the processor as described in the section "Handling the Hardware
Cache and the TLB" in Chapter 2.
处理器间中断处理程序的汇编语言代码由宏生成BUILD_INTERRUPT:它保存寄存器,将向量号负 256 压入堆栈,然后调用与前面的低级处理程序同名的高级 C 函数由smp_. CALL_FUNCTION_VECTOR例如,由低级处理程序调用的处理器间中断的高级处理程序call_function_interrupt( )名为
smp_call_function_interrupt( )。每个高级处理程序都会确认本地 APIC 上的处理器间中断,然后执行由中断触发的特定操作。
The assembly language code of the interprocessor interrupt
handlers is generated by the BUILD_INTERRUPT macro: it saves the
registers, pushes the vector number minus 256 on the stack, and then
invokes a high-level C function having the same name as the low-level
handler preceded by smp_. For
instance, the high-level handler of the CALL_FUNCTION_VECTOR interprocessor
interrupt that is invoked by the low-level call_function_interrupt( ) handler is named
smp_call_function_interrupt( ).
Each high-level handler acknowledges the interprocessor interrupt on
the local APIC and then performs the specific action triggered by the
interrupt.
借助以下一组函数,发出处理器间中断 (IPI) 变得很容易:
Thanks to the following group of functions, issuing interprocessor interrupts (IPIs) becomes an easy task:
send_IPI_all( )send_IPI_all( )向所有 CPU(包括发送方)发送 IPI
Sends an IPI to all CPUs (including the sender)
send_IPI_allbutself(
)send_IPI_allbutself(
)将 IPI 发送到除发送者之外的所有 CPU
Sends an IPI to all CPUs except the sender
send_IPI_self( )send_IPI_self( )将 IPI 发送到发送方 CPU
Sends an IPI to the sender CPU
send_IPI_mask( )send_IPI_mask( )将 IPI 发送到由位掩码指定的一组 CPU
Sends an IPI to a group of CPUs specified by a bit mask
[ * ]与 相比disable_irq_nosync( ),disable_irq(n)等待其他 CPU 上运行的IRQ n的所有中断处理程序完成后再返回。
[*] In contrast to disable_irq_nosync( ), disable_irq(n) waits until all
interrupt handlers for IRQ n that are
running on other CPUs have completed before returning.
[ * ]但有一个例外。Linux 通常以这样的方式设置本地 APIC,以尊重焦点处理器(如果存在)。焦点进程只要收到该类型的 IRQ,并且尚未完成中断处理程序的执行,就会捕获该类型的所有 IRQ。然而,英特尔已经放弃了对焦点处理器的支持在奔腾 4 型号中。
[*] There is an exception, though. Linux usually sets up the local APICs in such a way to honor the focus processor, when it exists. A focus process will catch all IRQs of the same type as long as it has received an IRQ of that type, and it has not finished executing the interrupt handler. However, Intel has dropped support for focus processors in the Pentium 4 model.
[ * ] 256 个向量是 80×86 架构的架构限制。其中 32 个为 CPU 使用或保留,因此可用向量空间由 224 个向量组成。
[*] 256 vectors is an architectural limit for the 80×86 architecture. 32 of them are used or reserved for the CPU, so the usable vector space consists of 224 vectors.
[ * ]因为IRQ_PENDING是标志而不是计数器,所以只能识别第二次发生的中断。的循环的每次迭代中的进一步出现do_IRQ( )都会丢失。
[*] Because IRQ_PENDING is
a flag and not a counter, only the second occurrence of the
interrupt can be recognized. Further occurrences in each
iteration of the do_IRQ( )'s
loop are simply lost.
[ * ]软盘是“旧”设备,通常不允许 IRQ 共享。
[*] Floppy disks are "old" devices that do not usually allow IRQ sharing.
[ * ]从 Pentium Pro 型号开始,Intel 微处理器包含这些附加寄存器以轻松自定义缓存操作。例如,Linux 可以使用这些寄存器来禁用映射 PCI/AGP 图形卡帧缓冲区的地址的硬件缓存,同时保持“写组合”操作模式:分页单元在复制之前将写传输组合成更大的块进入帧缓冲区。
[*] Starting with the Pentium Pro model, Intel microprocessors include these additional registers to easily customize cache operations. For instance, Linux may use these registers to disable the hardware cache for the addresses mapping the frame buffer of a PCI/AGP graphic card while maintaining the "write combining" mode of operation: the paging unit combines write transfers into larger chunks before copying them into the frame buffer.
我们前面在“中断处理”一节中提到,内核执行的任务中有几个并不重要:如果需要,它们可以推迟很长一段时间。请记住,中断处理程序的中断服务例程是串行化的,并且通常在相应的中断处理程序终止之前不会发生中断。相反,可延迟任务可以在启用所有中断的情况下执行。将它们从中断处理程序中取出有助于保持较短的内核响应时间。对于许多希望在几毫秒内处理其中断请求的时间关键型应用程序来说,这是一个非常重要的属性。
We mentioned earlier in the section "Interrupt Handling" that several tasks among those executed by the kernel are not critical: they can be deferred for a long period of time, if necessary. Remember that the interrupt service routines of an interrupt handler are serialized, and often there should be no occurrence of an interrupt until the corresponding interrupt handler has terminated. Conversely, the deferrable tasks can execute with all interrupts enabled. Taking them out of the interrupt handler helps keep kernel response time small. This is a very important property for many time-critical applications that expect their interrupt requests to be serviced in a few milliseconds.
Linux 2.6 通过使用两种非紧急可中断内核函数来回答这样的挑战:所谓的 可延迟函数 [ * ] ( softirqs 和小任务 ),以及通过一些工作队列执行的那些(我们将在本章后面的“工作队列”部分中描述它们)。
Linux 2.6 answers such a challenge by using two kinds of non-urgent interruptible kernel functions: the so-called deferrable functions [*] (softirqs and tasklets ), and those executed by means of some work queues (we will describe them in the section "Work Queues" later in this chapter).
软中断和微线程严格相关,因为微线程是在软中断之上实现的。事实上,出现在内核源代码中的术语“软中断”通常表示这两种可延迟函数。另一个广泛使用的术语是 中断上下文 :它指定内核当前正在执行中断处理程序或可延迟函数。
Softirqs and tasklets are strictly correlated, because tasklets are implemented on top of softirqs. As a matter of fact, the term "softirq," which appears in the kernel source code, often denotes both kinds of deferrable functions. Another widely used term is interrupt context : it specifies that the kernel is currently executing either an interrupt handler or a deferrable function.
软中断是静态分配的(即在编译时定义),而微线程也可以在运行时分配和初始化(例如,在加载内核模块时)。软中断可以在多个 CPU 上同时运行,即使它们属于同一类型。因此,软中断是可重入函数,必须使用自旋锁显式保护其数据结构。Tasklet 不必担心这一点,因为它们的执行是由内核更严格地控制的。相同类型的tasklet总是被序列化的:换句话说,相同类型的tasklet不能同时被两个CPU执行。然而,不同类型的微线程可以在多个CPU上同时执行。序列化 tasklet 简化了设备驱动程序开发人员的工作,
Softirqs are statically allocated (i.e., defined at compile time), while tasklets can also be allocated and initialized at runtime (for instance, when loading a kernel module). Softirqs can run concurrently on several CPUs, even if they are of the same type. Thus, softirqs are reentrant functions and must explicitly protect their data structures with spin locks. Tasklets do not have to worry about this, because their execution is controlled more strictly by the kernel. Tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs. Serializing the tasklet simplifies the life of device driver developers, because the tasklet function needs not be reentrant.
一般来说,可延迟函数可以进行四种操作:
Generally speaking, four kinds of operations can be performed on deferrable functions:
定义一个新的可延迟函数;此操作通常在内核初始化或加载模块时完成。
Defines a new deferrable function; this operation is usually done when the kernel initializes itself or a module is loaded.
将可延迟函数标记为“待处理”——在内核下次安排一轮可延迟函数执行时运行。可以随时激活(甚至在处理中断时)。
Marks a deferrable function as "pending" — to be run the next time the kernel schedules a round of executions of deferrable functions. Activation can be done at any time (even while handling interrupts).
有选择地禁用可延迟函数,这样即使它被激活,内核也不会执行它。我们将在第 5 章的“禁用和启用可延迟函数”部分中看到,禁用可延迟函数有时是必要的。
Selectively disables a deferrable function so that it will not be executed by the kernel even if activated. We'll see in the section "Disabling and Enabling Deferrable Functions" in Chapter 5 that disabling deferrable functions is sometimes essential.
与相同类型的所有其他待处理可延迟函数一起执行待处理可延迟函数;执行是在指定的时间执行的,稍后将在“软中断”部分中进行解释。
Executes a pending deferrable function together with all other pending deferrable functions of the same type; execution is performed at well-specified times, explained later in the section "Softirqs."
激活和执行是绑定在一起的:已由给定 CPU 激活的可延迟函数必须在同一 CPU 上执行。没有不言而喻的理由表明该规则有利于系统性能。将可延迟函数绑定到激活的CPU理论上可以更好地利用CPU硬件缓存。毕竟,可以想象,激活的内核线程会访问一些也将由可延迟函数使用的数据结构。然而,当可延迟函数运行时,相关行很容易不再位于缓存中,因为它的执行可能会被延迟很长时间。而且,将函数绑定到CPU始终是一个潜在的“危险”操作,
Activation and execution are bound together: a deferrable function that has been activated by a given CPU must be executed on the same CPU. There is no self-evident reason suggesting that this rule is beneficial for system performance. Binding the deferrable function to the activating CPU could in theory make better use of the CPU hardware cache. After all, it is conceivable that the activating kernel thread accesses some data structures that will also be used by the deferrable function. However, the relevant lines could easily be no longer in the cache when the deferrable function is run because its execution can be delayed a long time. Moreover, binding a function to a CPU is always a potentially "dangerous" operation, because one CPU might end up very busy while the others are mostly idle.
Linux 2.6 使用有限数量的软中断。对于大多数用途来说,tasklet 已经足够好了,并且更容易编写,因为它们不需要可重入。
Linux 2.6 uses a limited number of softirqs . For most purposes, tasklets are good enough and are much easier to write because they do not need to be reentrant.
事实上, 目前仅定义了表4-9中列出的六种软中断。
As a matter of fact, only the six kinds of softirqs listed in Table 4-9 are currently defined.
表 4-9。Linux 2.6 中使用的软中断
Table 4-9. Softirqs used in Linux 2.6
软中断 Softirq | 索引(优先) Index (priority) | 描述 Description |
|---|---|---|
| 0 0 | 处理高优先级的tasklet Handles high priority tasklets |
| 1 1 | 与定时器中断相关的tasklet Tasklets related to timer interrupts |
| 2 2 | 将数据包传输到网卡 Transmits packets to network cards |
| 3 3 | 从网卡接收数据包 Receives packets from network cards |
| 4 4 | SCSI 命令的中断后处理 Post-interrupt processing of SCSI commands |
| 5 5 | 处理常规的tasklet Handles regular tasklets |
sofirq 的索引决定了它的优先级:较低的索引意味着较高的优先级,因为 softirq 函数将从索引 0 开始执行。
The index of a sofirq determines its priority: a lower index means higher priority because softirq functions will be executed starting from index 0.
用于表示软中断的主要数据结构是数组softirq_vec,其中包含 32 个类型的元素softirq_action。softirq_action软中断的优先级是数组内相应元素的索引。如表4-9所示,只有数组的前六个条目被有效使用。该softirq_action数据结构由两个字段组成:一个action指向softirq函数的指针和一个data指向softirq函数可能需要的通用数据结构的指针。
The main data structure used to represent softirqs is
the softirq_vec array, which
includes 32 elements of type softirq_action. The priority of a softirq
is the index of the corresponding softirq_action element inside the array.
As shown in Table
4-9, only the first six entries of the array are effectively
used. The softirq_action data
structure consists of two fields: an action pointer to the softirq function and
a data pointer to a generic data
structure that may be needed by the softirq function.
另一个关键字段用于跟踪内核抢占以及内核控制路径的嵌套preempt_count是存储在每个进程描述符字段中的32位字段(参见第3章中的“识别进程”thread_info部分)。该字段对三个不同的计数器和一个标志进行编码,如表 4-10所示。
Another critical field used to keep track both of kernel
preemption and of nesting of kernel control paths is the 32-bit preempt_count field stored in the thread_info field of each process
descriptor (see the section "Identifying a Process"
in Chapter 3). This field
encodes three distinct counters plus a flag, as shown in Table 4-10.
表 4-10。preempt_count 字段的子字段(续)
Table 4-10. Subfields of the preempt_count field (continues)
位 Bits | 描述 Description |
|---|---|
0–7 0–7 | 抢占计数器(最大值 = 255) Preemption counter (max value = 255) |
8–15 8–15 | 软中断计数器(最大值 = 255)。 Softirq counter (max value = 255). |
16–27 16–27 | Hardirq 计数器(最大值 = 4096) Hardirq counter (max value = 4096) |
28 28 | |
第一个计数器记录本地 CPU 上显式禁用内核抢占的次数;值零意味着内核抢占根本没有被显式禁用。第二个计数器指定可延迟功能禁用的深度(级别 0 表示可延迟功能启用)。第三个计数器指定本地 CPU 上嵌套中断处理程序的数量(该值增加或irq_enter( )减少
;请参阅本章前面的“ I/O 中断处理irq_exit( )”部分)。
The first counter keeps track of how many times kernel
preemption has been explicitly disabled on the local CPU; the value
zero means that kernel preemption has not been explicitly disabled
at all. The second counter specifies how many levels deep the
disabling of deferrable functions is (level 0 means that deferrable
functions are enabled). The third counter specifies the number of
nested interrupt handlers on the local CPU (the value is increased
by irq_enter( ) and decreased by
irq_exit( ); see the section
"I/O Interrupt
Handling" earlier in this chapter).
该字段的名称有一个很好的理由preempt_count:当内核代码显式禁用内核可抢占性(抢占计数器不为零)或当内核在中断上下文中运行时,必须禁用内核可抢占性。因此,为了确定当前进程是否可以被抢占,内核会快速检查该preempt_count字段中的值是否为零。内核抢占将在第 5 章“内核抢占”
一节中深入讨论。
There is a good reason for the name of the preempt_count field: kernel preemptability
has to be disabled either when it has been explicitly disabled by
the kernel code (preemption counter not zero) or when the kernel is
running in interrupt context. Thus, to determine whether the current
process can be preempted, the kernel quickly checks for a zero value
in the preempt_count field.
Kernel preemption will be discussed in depth in the section "Kernel Preemption" in
Chapter 5.
该in_interrupt( )宏检查现场的硬中断和软中断计数器current_thread_info( )->preempt_count
。如果这两个计数器之一为正,则宏产生一个非零值,否则产生零值。如果内核不使用多个内核模式堆栈,则宏始终查看
当前进程的描述符preempt_count
字段。thread_info但是,如果内核使用多个内核模式堆栈,则宏可能会查看与本地 CPU 关联的联合中包含的描述符
preempt_count中的字段。在这种情况下,宏返回一个非零值,因为该字段始终设置为正值。thread_infoirq_ctx
The in_interrupt( ) macro
checks the hardirq and softirq counters in the current_thread_info( )->preempt_count
field. If either one of these two counters is positive, the macro
yields a nonzero value, otherwise it yields the value zero. If the
kernel does not make use of multiple Kernel Mode stacks, the macro
always looks at the preempt_count
field of the thread_info
descriptor of the current process. If, however, the kernel makes use
of multiple Kernel Mode stacks, the macro might look at the preempt_count field in the thread_info descriptor contained in a
irq_ctx union associated with the
local CPU. In this case, the macro returns a nonzero value because
the field is always set to a positive value.
实现软中断的最后一个关键数据结构是描述待处理软中断的每 CPU 32 位掩码;它存储在数据结构_ _softirq_pending
的字段中irq_cpustat_t(回想一下,系统中的每个 CPU 都有一个这样的结构;参见表4-8)。为了获取和设置位掩码的值,内核使用local_softirq_pending(
)选择本地 CPU 的软中断位掩码的宏。
The last crucial data structure for implementing the softirqs
is a per-CPU 32-bit mask describing the pending softirqs; it is
stored in the _ _softirq_pending
field of the irq_cpustat_t data
structure (recall that there is one such structure per each CPU in
the system; see Table
4-8). To get and set the value of the bit mask, the kernel
makes use of the local_softirq_pending(
) macro that selects the softirq bit mask of the local
CPU.
该open_softirq( )
函数负责软中断初始化。它使用三个参数:软中断索引、指向要执行的软中断函数的指针以及指向软中断函数可能需要的数据结构的第二个指针。open_softirq( )限制自身初始化数组的正确条目softirq_vec。
The open_softirq( )
function takes care of softirq initialization. It uses three
parameters: the softirq index, a pointer to the softirq function to
be executed, and a second pointer to a data structure that may be
required by the softirq function. open_softirq( ) limits itself to
initializing the proper entry of the softirq_vec array.
软中断通过该raise_softirq( )函数激活。该函数接收软中断索引作为其参数nr,执行以下操作:
Softirqs are activated by means of the raise_softirq( ) function. This function,
which receives as its parameter the softirq index nr, performs the following actions:
Executes the local_irq_save macro to save the state
of the IF flag of the
eflags register and to disable interrupts on the local
CPU.
nr
通过设置与本地 CPU 的软中断位掩码中的索引对应的位,将软中断标记为挂起。
Marks the softirq as pending by setting the bit
corresponding to the index nr
in the softirq bit mask of the local CPU.
如果in_interrupt()
产生值 1,则跳转到步骤 5。这种情况表明已raise_softirq( )
在中断上下文中调用,或者软中断当前已禁用。
If in_interrupt()
yields the value 1, it jumps to step 5. This situation indicates
either that raise_softirq( )
has been invoked in interrupt context, or that the softirqs are
currently disabled.
Otherwise, invokes wakeup_softirqd() to wake up, if
necessary, the ksoftirqd kernel thread of the local CPU (see
later).
执行宏以恢复步骤 1 中保存的标志local_irq_restore的状态。IF
Executes the local_irq_restore macro to restore the
state of the IF flag saved in
step 1.
应定期执行活动(待处理)软中断的检查,但不会引起太多开销。它们是在内核代码的几个点中执行的。以下是最重要的点的列表(请注意,软中断检查点的数量和位置会随着内核版本和支持的硬件架构的变化而变化):
Checks for active (pending) softirqs should be perfomed periodically, but without inducing too much overhead. They are performed in a few points of the kernel code. Here is a list of the most significant points (be warned that number and position of the softirq checkpoints change both with the kernel version and with the supported hardware architecture):
当内核调用local_bh_enable( )函数[ * ]启用本地CPU上的软中断时
When the kernel invokes the local_bh_enable( ) function[*] to enable softirqs on the local CPU
当do_IRQ( )
函数完成处理 I/O 中断并调用
irq_exit( )宏时
When the do_IRQ( )
function finishes handling an I/O interrupt and invokes the
irq_exit( ) macro
如果系统使用 I/O APIC,当smp_apic_timer_interrupt( )函数完成处理本地定时器中断时(请参阅第 6 章中的“多处理器系统中的计时架构”部分)
If the system uses an I/O APIC, when the smp_apic_timer_interrupt( ) function
finishes handling a local timer interrupt (see the section
"Timekeeping
Architecture in Multiprocessor Systems" in Chapter 6)
CALL_FUNCTION_VECTOR在多处理器系统中,当 CPU 完成处理由处理器间中断触发的函数时
In multiprocessor systems, when a CPU finishes handling a
function triggered by a CALL_FUNCTION_VECTOR interprocessor
interrupt
当其中一个特殊的ksoftirqd/n 内核线程被唤醒时(见下文)
When one of the special ksoftirqd/n kernel threads is awakened (see later)
如果在一个这样的检查点检测到待处理的软中断(local_softirq_pending()不为零),内核会调用do_softirq( )来处理它们。该函数执行以下操作:
If pending softirqs are detected at one such
checkpoint (local_softirq_pending() is not zero), the
kernel invokes do_softirq( ) to
take care of them. This function performs the following
actions:
如果in_interrupt( )
产生值一,则该函数返回。这种情况表明要么do_softirq(
)已在中断上下文中调用,要么软中断当前已禁用。
If in_interrupt( )
yields the value one, this function returns. This situation
indicates either that do_softirq(
) has been invoked in interrupt context or that the
softirqs are currently disabled.
执行local_irq_save
以保存标志的状态IF
并禁用本地 CPU 上的中断。
Executes local_irq_save
to save the state of the IF
flag and to disable the interrupts on the local CPU.
如果该结构的大小thread_union为 4 KB,则如有必要,它会切换到软 IRQ 堆栈。do_IRQ(
)此步骤与前面部分“ I/O 中断处理”中的步骤 2 非常相似。当然,softirq_ctx使用数组而不是
hardirq_ctx。
If the size of the thread_union structure is 4 KB, it
switches to the soft IRQ stack, if necessary. This step is very
similar to step 2 of do_IRQ(
) in the earlier section "I/O Interrupt
Handling;" of course, the softirq_ctx array is used instead of
hardirq_ctx.
调用该_ _do_softirq(
)函数(请参阅以下部分)。
Invokes the _ _do_softirq(
) function (see the following section).
如果软IRQ堆栈在上面的步骤3中已被有效切换,它将原始堆栈指针恢复到寄存器中
esp,从而切换回之前使用的异常堆栈。
If the soft IRQ stack has been effectively switched in
step 3 above, it restores the original stack pointer into the
esp register, thus switching
back to the exception stack that was in use before.
执行以恢复步骤 2 中保存的标志local_irq_restore的状态(本地中断启用或禁用)并返回。IF
Executes local_irq_restore to restore the state
of the IF flag (local
interrupts enabled or disabled) saved in step 2 and
returns.
该_ _do_softirq(
)函数读取本地CPU的软中断位掩码,并执行与每个设置位相对应的可延迟函数。执行软中断函数时,可能会弹出新的待处理软中断;为了确保可延迟函数的低延迟时间,_ _do_softirq( )保持运行直到所有挂起的软中断都被执行。然而,这种机制可能会迫使 _ _do_softirq( )运行很长一段时间,从而大大延迟用户模式进程。因此,__do_softirq( )执行固定次数的迭代然后返回。剩余的待处理软中断(如果有)将由ksoftirqd及时处理
下一节将介绍内核线程。以下是该函数执行的操作的简短描述:
The _ _do_softirq(
) function reads the softirq bit mask of the local CPU and
executes the deferrable functions corresponding to every set bit.
While executing a softirq function, new pending softirqs might pop
up; in order to ensure a low latency time for the deferrable
funtions, _ _do_softirq( ) keeps
running until all pending softirqs have been executed. This
mechanism, however, could force _ _do_softirq( ) to run for long periods of
time, thus considerably delaying User Mode processes. For that
reason, _ _do_softirq( ) performs
a fixed number of iterations and then returns. The remaining pending
softirqs, if any, will be handled in due time by the
ksoftirqd kernel thread described in the next
section. Here is a short description of the actions performed by the
function:
将迭代计数器初始化为 10。
Initializes the iteration counter to 10.
将本地 CPU 的软中断位掩码(由 选择
local_softirq_pending( ))复制到pending本地变量中。
Copies the softirq bit mask of the local CPU (selected by
local_softirq_pending( )) in
the pending local
variable.
调用local_bh_disable(
)以增加软中断计数器。在开始执行可延迟函数之前应该禁用它们,这有点违反直觉,但这确实很有意义。由于可延迟函数大多在启用中断的情况下运行,因此可以在函数中间引发中断_ _do_softirq( )
。当do_IRQ( )
执行宏时,可以启动irq_exit( )
该函数的另一个实例。_
_do_softirq( )必须避免这种情况,因为可延迟函数必须在 CPU 上串行执行。因此,第一个实例_ _do_softirq( )禁用可延迟函数,以便函数的每个新实例都将在 的步骤 1 退出do_softirq(
)。
Invokes local_bh_disable(
) to increase the softirq counter. It is somewhat
counterintuitive that deferrable functions should be disabled
before starting to execute them, but it really makes a lot of
sense. Because the deferrable functions mostly run with
interrupts enabled, an interrupt can be raised in the middle of
the _ _do_softirq( )
function. When do_IRQ( )
executes the irq_exit( )
macro, another instance of the _
_do_softirq( ) function could be started. This has to
be avoided, because deferrable functions must execute serially
on the CPU. Thus, the first instance of _ _do_softirq( ) disables deferrable
functions, so that every new instance of the function will exit
at step 1 of do_softirq(
).
清除本地CPU的软中断位图,以便可以激活新的软中断(位掩码的值已在pending
步骤2中保存在本地变量中)。
Clears the softirq bitmap of the local CPU, so that new
softirqs can be activated (the value of the bit mask has already
been saved in the pending
local variable in step 2).
执行local_irq_enable(
)以启用本地中断。
Executes local_irq_enable(
) to enable local interrupts.
对于局部变量中设置的每个位pending,它执行相应的softirq函数;回想一下,带索引的软中断的函数地址n存储在 中softirq_vec[n]->action。
For each bit set in the pending local variable, it executes
the corresponding softirq function; recall that the function
address for the softirq with index n is stored in softirq_vec[n]->action.
执行local_irq_disable()以禁用本地中断。
Executes local_irq_disable() to disable local
interrupts.
将本地 CPU 的软中断位掩码复制到
pending本地变量中,并再次减少迭代计数器一次。
Copies the softirq bit mask of the local CPU into the
pending local variable and
decreases the iteration counter one more time.
如果pending不为零(自上一次迭代开始以来至少有一个软中断已被激活)并且迭代计数器仍然为正,则跳回到步骤 4。
If pending is not
zero—at least one softirq has been activated since the start of
the last iteration—and the iteration counter is still positive,
it jumps back to step 4.
如果有更多挂起的软中断,它会调用wakeup_softirqd( )唤醒负责本地 CPU 软中断的内核线程(请参阅下一节)。
If there are more pending softirqs, it invokes wakeup_softirqd( ) to wake up the
kernel thread that takes care of the softirqs for the local CPU
(see next section).
从软中断计数器中减去 1,从而重新启用可延迟功能。
Subtracts 1 from the softirq counter, thus reenabling the deferrable functions.
在最近的内核版本中,每个 CPU 都有自己的
ksoftirqd/n内核线程(其中
n是 CPU 的逻辑编号)。每个
ksoftirqd/n内核线程都运行该ksoftirqd( )函数,该函数本质上执行以下循环:
In recent kernel versions, each CPU has its own
ksoftirqd/n kernel thread (where
n is the logical number of the CPU). Each
ksoftirqd/n kernel thread runs the ksoftirqd( ) function, which essentially
executes the following loop:
为了(;;) {
set_current_state(TASK_INTERRUPTIBLE);
日程( );
/* 现在处于 TASK_RUNNING 状态 */
while (local_softirq_pending()) {
preempt_disable();
do_softirq();
preempt_enable();
cond_resched();
}
} for(;;) {
set_current_state(TASK_INTERRUPTIBLE );
schedule( );
/* now in TASK_RUNNING state */
while (local_softirq_pending( )) {
preempt_disable();
do_softirq( );
preempt_enable();
cond_resched( );
}
}当被唤醒时,内核线程检查local_softirq_pending()软中断位掩码,并在必要时调用do_softirq(
). 如果没有待处理的软中断,则该函数将当前进程置于该TASK_INTERRUPTIBLE状态,然后调用该函数以在当前进程(当前
设置的标志)cond_resched()需要时执行进程切换
。TIF_NEED_RESCHEDthread_info
When awakened, the kernel thread checks the local_softirq_pending() softirq bit mask
and invokes, if necessary, do_softirq(
). If there are no softirqs pending, the function puts the
current process in the TASK_INTERRUPTIBLE state and invokes then
the cond_resched() function to
perform a process switch if required by the current process (flag
TIF_NEED_RESCHED of the current
thread_info set).
ksoftirqd /n内核线程代表了关键权衡问题的解决方案。
The ksoftirqd/n kernel threads represent a solution for a critical trade-off problem.
Softirq 函数可能会自行重新激活;事实上,网络软中断和微线程软中断都执行此操作。此外,外部事件(例如网卡上的数据包泛洪)可能会以非常高的频率激活软中断。
Softirq functions may reactivate themselves; in fact, both the networking softirqs and the tasklet softirqs do this. Moreover, external events, such as packet flooding on a network card, may activate softirqs at very high frequency.
连续大量软中断流的潜力会产生一个问题,通过引入内核线程可以解决这个问题。如果没有它们,开发人员基本上面临两种替代策略。
The potential for a continuous high-volume flow of softirqs creates a problem that is solved by introducing kernel threads. Without them, developers are essentially faced with two alternative strategies.
第一个策略是忽略do_softirq( )运行时发生的新软中断。换句话说,该do_softirq(
)函数可以在函数启动时确定哪些软中断正在挂起,然后执行它们的函数。接下来,它将终止而不重新检查挂起的软中断。这个解决方案还不够好。假设在执行期间重新激活了一个softirq函数do_softirq( )。在最坏的情况下,即使机器空闲,软中断也不会再次执行,直到下一个定时器中断。因此,软中断延迟时间对于网络开发人员来说是不可接受的。
The first strategy consists of ignoring new softirqs that
occur while do_softirq( ) is
running. In other words, the do_softirq(
) function could determine what softirqs are pending when
the function is started and then execute their functions. Next, it
would terminate without rechecking the pending softirqs. This
solution is not good enough. Suppose that a softirq function is
reactivated during the execution of do_softirq( ). In the worst case, the
softirq is not executed again until the next timer interrupt, even
if the machine is idle. As a result, softirq latency time is
unacceptable for networking developers.
第二种策略包括不断重新检查待处理的软中断。该do_softirq(
)函数可以继续检查挂起的软中断,并且仅当没有挂起的软中断时才会终止。虽然这个解决方案可能会让网络开发人员满意,但它肯定会让系统的普通用户感到不安:如果网卡接收到高频数据包流或软中断函数不断激活自身,则该函数永远不会返回,并且用户模式do_softirq( )
程序几乎都停止了。
The second strategy consists of continuously rechecking for
pending softirqs. The do_softirq(
) function could keep checking the pending softirqs and
would terminate only when none of them is pending. While this
solution might satisfy networking developers, it can certainly upset
normal users of the system: if a high-frequency flow of packets is
received by a network card or a softirq function keeps activating
itself, the do_softirq( )
function never returns, and the User Mode programs are virtually
stopped.
ksoftirqd /n内核线程试图解决这个困难的权衡问题。该do_softirq( )函数确定哪些软中断正在挂起并执行它们的函数。经过几次迭代后,如果软中断流没有停止,该函数将唤醒内核线程并终止(__ 的第 10 步do_softirq( ))。内核线程优先级低,用户程序有机会运行;但如果机器空闲,则挂起的软中断会很快执行。
The ksoftirqd/n kernel threads try to
solve this difficult trade-off problem. The do_softirq( ) function determines what
softirqs are pending and executes their functions. After a few
iterations, if the flow of softirqs does not stop, the function
wakes up the kernel thread and terminates (step 10 of _ _do_softirq( )). The kernel thread has low
priority, so user programs have a chance to run; but if the machine
is idle, the pending softirqs are executed quickly.
Tasklet 是在 I/O 驱动程序中实现可延迟函数的首选方法。正如已经解释过的,taskletHI_SOFTIRQ构建在两个名为和的软中断之上TASKLET_SOFTIRQ。多个微线程可以与同一个软中断相关联,每个微线程承载其自己的功能。两个软中断之间没有真正的区别,只是先do_softirq( )执行 的
HI_SOFTIRQ微线程,然后再
执行TASKLET_SOFTIRQ的微线程。
Tasklets are the preferred way to implement deferrable
functions in I/O drivers. As already explained, tasklets are built on top of two softirqs named HI_SOFTIRQ and TASKLET_SOFTIRQ. Several tasklets may be
associated with the same softirq, each tasklet carrying its own
function. There is no real difference between the two softirqs, except
that do_softirq( ) executes
HI_SOFTIRQ's tasklets before
TASKLET_SOFTIRQ's tasklets.
小任务和高优先级小任务分别存储在tasklet_vec和tasklet_hi_vec数组中。它们都包含NR_CPUStype 的元素tasklet_head,每个元素都包含一个指向tasklet 描述符列表的指针。tasklet描述符是一个类型为 的数据结构,其字段如表4-11tasklet_struct所示。
Tasklets and high-priority tasklets are stored in the tasklet_vec and tasklet_hi_vec arrays, respectively. Both of
them include NR_CPUS elements of
type tasklet_head, and each element
consists of a pointer to a list of tasklet
descriptors. The tasklet descriptor is a data structure of
type tasklet_struct, whose fields
are shown in Table
4-11.
表 4-11。小任务描述符的字段
Table 4-11. The fields of the tasklet descriptor
字段名称 Field name | 描述 Description |
|---|---|
| 指向列表中下一个描述符的指针 Pointer to next descriptor in the list |
| 小任务的状态 Status of the tasklet |
| 锁定计数器 Lock counter |
| 指向tasklet函数的指针 Pointer to the tasklet function |
| 可由 tasklet 函数使用的无符号长整型 An unsigned long integer that may be used by the tasklet function |
stateTasklet描述符的字段包括两个标志:
The state field of the
tasklet descriptor includes two flags:
TASKLET_STATE_SCHEDTASKLET_STATE_SCHED设置后,这表明该 tasklet 处于挂起状态(已被安排执行);tasklet_vec这也意味着微线程描述符被插入到和数组的列表之一中tasklet_hi_vec。
When set, this indicates that the tasklet is pending (has
been scheduled for execution); it also means that the tasklet
descriptor is inserted in one of the lists of the tasklet_vec and tasklet_hi_vec arrays.
TASKLET_STATE_RUNTASKLET_STATE_RUN当设置时,表明该tasklet正在执行;在单处理器系统上,不使用此标志,因为不需要检查特定的微线程是否正在运行。
When set, this indicates that the tasklet is being executed; on a uniprocessor system this flag is not used because there is no need to check whether a specific tasklet is running.
假设您正在编写一个设备驱动程序并且想要使用一个tasklet:必须做什么?首先,您应该分配一个新的tasklet_struct数据结构并通过调用初始化它tasklet_init(
);该函数接收微线程描述符的地址、微线程函数的地址及其可选的整数参数作为其参数。
Let's suppose you're writing a device driver and you want to use
a tasklet: what has to be done? First of all, you should allocate a
new tasklet_struct data structure
and initialize it by invoking tasklet_init(
); this function receives as its parameters the address of
the tasklet descriptor, the address of your tasklet function, and its
optional integer argument.
tasklet_disable_nosync( )可以通过调用或 来
选择性地禁用该tasklet
tasklet_disable( )。这两个函数都会增加count微线程描述符的字段,但后一个函数直到已运行的微线程函数实例终止后才会返回。要重新启用该微线程,请使用tasklet_enable(
).
The tasklet may be selectively disabled by invoking either
tasklet_disable_nosync( ) or
tasklet_disable( ). Both functions
increase the count field of the
tasklet descriptor, but the latter function does not return until an
already running instance of the tasklet function has terminated. To
reenable the tasklet, use tasklet_enable(
).
要激活该微线程,您应该根据该微线程所需的优先级来调用该tasklet_schedule( )函数或该函数。tasklet_hi_schedule( )两个功能非常相似;他们每个人都执行以下操作:
To activate the tasklet, you should invoke either the tasklet_schedule( ) function or the tasklet_hi_schedule( ) function, according
to the priority that you require for the tasklet. The two functions
are very similar; each of them performs the following actions:
检查TASKLET_STATE_SCHED标志;如果已设置,则返回(tasklet 已被调度)。
Checks the TASKLET_STATE_SCHED flag; if it is set,
returns (the tasklet has already been scheduled).
调用local_irq_save以保存标志的状态IF并禁用本地中断。
Invokes local_irq_save to
save the state of the IF flag
and to disable local interrupts.
tasklet_vec[n]将微线程描述符添加到或
指向的列表的开头tasklet_hi_vec[n],其中
n表示本地 CPU 的逻辑编号。
Adds the tasklet descriptor at the beginning of the list
pointed to by tasklet_vec[n] or
tasklet_hi_vec[n], where
n denotes the logical number of
the local CPU.
调用raise_softirq_irqoff(
)以激活 或TASKLET_SOFTIRQ软HI_SOFTIRQ中断(此函数与 类似raise_softirq( ),但它假定本地中断已被禁用)。
Invokes raise_softirq_irqoff(
) to activate either the TASKLET_SOFTIRQ or the HI_SOFTIRQ softirq (this function is
similar to raise_softirq( ),
except that it assumes that local interrupts are already
disabled).
调用local_irq_restore
以恢复标志的状态IF
。
Invokes local_irq_restore
to restore the state of the IF
flag.
最后我们看看tasklet是如何执行的。从上一节我们知道,一旦激活,softirq函数就会被该do_softirq( )函数执行。与软中断关联的软中断函数HI_SOFTIRQ名为tasklet_hi_action( ),而与 关联的函数TASKLET_SOFTIRQ名为tasklet_action( )。再次强调,这两个功能非常相似;他们每个人:
Finally, let's see how the tasklet is executed. We know from the
previous section that, once activated, softirq functions are executed
by the do_softirq( ) function. The
softirq function associated with the HI_SOFTIRQ softirq is named tasklet_hi_action( ), while the function
associated with TASKLET_SOFTIRQ is
named tasklet_action( ). Once
again, the two functions are very similar; each of them:
禁用本地中断。
Disables local interrupts.
n
获取本地CPU的逻辑编号。
Gets the logical number n
of the local CPU.
存储局部变量tasklet_vec[n]或所指向的列表的地址。tasklet_hi_vec[n]list
Stores the address of the list pointed to by tasklet_vec[n] or tasklet_hi_vec[n] in the list local variable.
将地址放入orNULL中
,从而清空调度的微线程描述符列表。tasklet_vec[n]tasklet_hi_vec[n]
Puts a NULL address in
tasklet_vec[n] or tasklet_hi_vec[n], thus emptying the
list of scheduled tasklet descriptors.
启用本地中断。
Enables local interrupts.
对于 指向的列表中的每个微线程描述符
list:
在多处理器系统中,检查TASKLET_STATE_RUN微线程的标志。
tasklet_vec[n]如果它被设置,则相同类型的tasklet已经在另一个CPU上运行,因此该函数将任务描述符重新插入到or指向的列表中,tasklet_hi_vec[n]并再次激活TASKLET_SOFTIRQor
HI_SOFTIRQ软中断。这样,tasklet 的执行就会被推迟,直到没有其他相同类型的 tasklet 在其他 CPU 上运行。
否则,tasklet 没有在另一个 CPU 上运行:设置该标志,使得 tasklet 函数不能在其他 CPU 上执行。
通过查看
count微线程描述符的字段来检查微线程是否被禁用。如果该微线程被禁用,它会清除其标志并将任务描述符重新插入到或TASKLET_STATE_RUN指向的列表中;然后该函数再次激活或软中断。tasklet_vec[n]tasklet_hi_vec[n]TASKLET_SOFTIRQHI_SOFTIRQ
如果启用了tasklet,它会清除该TASKLET_STATE_SCHED标志并执行tasklet 函数。
For each tasklet descriptor in the list pointed to by
list:
In multiprocessor systems, checks the TASKLET_STATE_RUN flag of the
tasklet.
If it is set, a tasklet of the same type is already
running on another CPU, so the function reinserts the task
descriptor in the list pointed to by tasklet_vec[n] or tasklet_hi_vec[n] and activates
the TASKLET_SOFTIRQ or
HI_SOFTIRQ softirq
again. In this way, execution of the tasklet is deferred
until no other tasklets of the same type are running on
other CPUs.
Otherwise, the tasklet is not running on another CPU: sets the flag so that the tasklet function cannot be executed on other CPUs.
Checks whether the tasklet is disabled by looking at the
count field of the tasklet
descriptor. If the tasklet is disabled, it clears its TASKLET_STATE_RUN flag and reinserts
the task descriptor in the list pointed to by tasklet_vec[n] or tasklet_hi_vec[n]; then the function
activates the TASKLET_SOFTIRQ or HI_SOFTIRQ softirq again.
If the tasklet is enabled, it clears the TASKLET_STATE_SCHED flag and
executes the tasklet function.
请注意,除非微线程函数重新激活自身,否则每次微线程激活最多都会触发该微线程函数的一次执行。
Notice that, unless the tasklet function reactivates itself, every tasklet activation triggers at most one execution of the tasklet function.
工作队列 已在 Linux 2.6 中引入,并取代了 Linux 2.4 中使用的称为“任务队列”的类似结构。它们允许激活内核函数(很像可延迟函数)并随后由特殊内核线程执行称为工作线程 。
The work queues have been introduced in Linux 2.6 and replace a similar construct called "task queue" used in Linux 2.4. They allow kernel functions to be activated (much like deferrable functions) and later executed by special kernel threads called worker threads .
尽管有相似之处,可延迟函数和工作队列还是有很大不同的。主要区别在于可延迟函数在中断上下文中运行,而工作队列中的函数在进程上下文中运行。在进程上下文中运行是执行可能阻塞的函数(例如,需要访问磁盘上某些数据块的函数)的唯一方法,因为正如在“异常和中断处理程序的嵌套执行”部分中已经观察到的那样在本章前面,在中断上下文中不能发生进程切换。可延迟函数和工作队列中的函数都不能访问进程的用户模式地址空间。事实上,可延迟函数不能对以下进程做出任何假设:执行时当前正在运行,而工作队列中的函数是由内核线程执行的,因此没有用户态地址空间可访问。
Despite their similarities, deferrable functions and work queues are quite different. The main difference is that deferrable functions run in interrupt context while functions in work queues run in process context. Running in process context is the only way to execute functions that can block (for instance, functions that need to access some block of data on disk) because, as already observed in the section "Nested Execution of Exception and Interrupt Handlers" earlier in this chapter, no process switch can take place in interrupt context. Neither deferrable functions nor functions in a work queue can access the User Mode address space of a process. In fact, a deferrable function cannot make any assumption about the process that is currently running when it is executed. On the other hand, a function in a work queue is executed by a kernel thread, so there is no User Mode address space to access.
与工作队列相关的主要数据结构是一个名为 的描述符workqueue_struct,其中包含NR_CPUS
元素数组、系统中 CPU 的最大数量等。[ * ]每个元素都是一个类型描述符,其字段如表4-12cpu_workqueue_struct所示。
The main data structure associated with a work queue is a
descriptor called workqueue_struct, which contains, among
other things, an array of NR_CPUS
elements, the maximum number of CPUs in the system.[*] Each element is a descriptor of type cpu_workqueue_struct, whose fields are
shown in Table
4-12.
表 4-12。cpu_workqueue_struct 结构体的字段
Table 4-12. The fields of the cpu_workqueue_struct structure
字段名称 Field name | 描述 Description |
|---|---|
| 自旋锁用于保护结构 Spin lock used to protect the structure |
| 使用的序列号 Sequence number used by |
| 使用的序列号 Sequence number used by |
| 待处理函数列表的头部 Head of the list of pending functions |
| 等待队列,工作线程在其中等待完成更多工作 Wait queue where the worker thread waiting for more work to be done sleeps |
| 等待队列,其中等待工作队列被刷新的进程处于睡眠状态 Wait queue where the processes waiting for the work queue to be flushed sleep |
|
Pointer to the |
| 该结构体的工作线程的进程描述符指针 Process descriptor pointer of the worker thread of the structure |
| 当前执行深度
Current execution depth of
|
该结构worklist的字段
cpu_workqueue_struct是一个双向链表的头,收集工作队列的待处理函数。每个挂起的函数都由一个数据结构表示,其字段如表4-13work_struct所示。
The worklist field of the
cpu_workqueue_struct structure is
the head of a doubly linked list collecting the pending functions of
the work queue. Every pending function is represented by a work_struct data structure, whose fields
are shown in Table
4-13.
表 4-13。work_struct结构体的字段
Table 4-13. The fields of the work_struct structure
字段名称 Field name | 描述 Description |
|---|---|
| 如果该函数已在工作队列列表中,则设置为 1,否则设置为 0 Set to 1 if the function is already in a work queue list, 0 otherwise |
| 指向待处理函数列表中的下一个和上一个元素的指针 Pointers to next and previous elements in the list of pending functions |
| 待处理函数的地址 Address of the pending function |
| 指针作为参数传递给挂起的函数 Pointer passed as a parameter to the pending function |
| 通常指向父
Usually points to the parent
|
| 用于延迟执行待处理函数的软件定时器 Software timer used to delay the execution of the pending function |
该create_workqueue("foo"
)函数接收一个字符串作为其参数,并返回workqueue_struct新创建的工作队列的描述符的地址。该函数还创建n 个
工作线程(其中n是系统中有效存在的 CPU 数量),以传递给函数的字符串命名:foo/0、
foo/1等。功能create_singlethread_workqueue( )类似,但无论系统中有多少个 CPU,它都只创建一个工作线程。为了销毁工作队列,内核调用该destroy_workqueue(
)函数,该函数接收一个指向数组的指针作为其参数
workqueue_struct。
The create_workqueue("foo"
) function receives as its parameter a string of
characters and returns the address of a workqueue_struct descriptor for the newly
created work queue. The function also creates n
worker threads (where n is the number of CPUs
effectively present in the system), named after the string passed to
the function: foo/0,
foo/1, and so on. The create_singlethread_workqueue( ) function
is similar, but it creates just one worker thread, no matter what
the number of CPUs in the system is. To destroy a work queue the
kernel invokes the destroy_workqueue(
) function, which receives as its parameter a pointer to a
workqueue_struct array.
queue_work( )work_struct在工作队列中插入一个函数(已经封装在描述符中);它接收一个wq指向
workqueue_struct描述符的指针和一个work指向work_struct描述符的指针。queue_work( )本质上执行以下步骤:
queue_work( ) inserts a
function (already packaged inside a work_struct descriptor) in a work queue;
it receives a pointer wq to the
workqueue_struct descriptor and a
pointer work to the work_struct descriptor. queue_work( ) essentially performs the
following steps:
检查要插入的函数是否已存在于工作队列中(work->pending字段等于1);如果是,则终止。
Checks whether the function to be inserted is already
present in the work queue (work->pending field equal to 1); if
so, terminates.
将描述符添加work_struct
到工作队列列表中,并设置work->pending为 1。
Adds the work_struct
descriptor to the work queue list, and sets work->pending to 1.
如果工作线程正在more_work本地CPUcpu_workqueue_struct
描述符的等待队列中休眠,则该函数将其唤醒。
If a worker thread is sleeping in the more_work wait queue of the local
CPU's cpu_workqueue_struct
descriptor, the function wakes it up.
该queue_delayed_work( )
函数几乎与 相同queue_work(
),只是它接收代表系统滴答时间延迟的第三个参数(请参阅第 6 章)。它用于确保执行待处理函数之前的最小延迟。实际上,queue_delayed_work( )
依靠描述符timer字段中的软件定时器work_struct来推迟描述work_struct
符在工作队列列表中的实际插入。cancel_delayed_work( )取消先前安排的工作队列功能,前提是相应的work_struct
描述符尚未插入工作队列列表中。
The queue_delayed_work( )
function is nearly identical to queue_work(
), except that it receives a third parameter representing
a time delay in system ticks (see Chapter 6). It is used to ensure
a minimum delay before the execution of the pending function. In
practice, queue_delayed_work( )
relies on the software timer in the timer field of the work_struct descriptor to defer the actual
insertion of the work_struct
descriptor in the work queue list. cancel_delayed_work( ) cancels a
previously scheduled work queue function, provided that the
corresponding work_struct
descriptor has not already been inserted in the work queue
list.
每个工作线程在函数内部不断执行循环
worker_thread( );大多数时候,线程处于睡眠状态并等待某些工作排队。一旦被唤醒,工作线程就会调用该函数,该函数实质上会从工作线程的工作队列列表中run_workqueue( )删除每个描述符,并执行相应的挂起函数。work_struct由于工作队列函数可能会阻塞,因此工作线程可以进入睡眠状态,甚至在恢复时迁移到另一个 CPU。[ * ]
Every worker thread continuously executes a loop inside the
worker_thread( ) function; most
of the time the thread is sleeping and waiting for some work to be
queued. Once awakened, the worker thread invokes the run_workqueue( ) function, which
essentially removes every work_struct descriptor from the work queue
list of the worker thread and executes the corresponding pending
function. Because work queue functions can block, the worker thread
can be put to sleep and even migrated to another CPU when
resumed.[*]
有时,内核必须等待工作队列中的所有挂起函数都已执行。该flush_workqueue( )函数接收
workqueue_struct描述符地址并阻塞调用进程,直到工作队列中挂起的所有函数终止。但是,该函数不会等待flush_workqueue( )
调用后添加到工作队列中的任何挂起函数;每个描述符的remove_sequence
和字段
用于识别新添加的待处理函数。insert_sequencecpu_workqueue_struct
Sometimes the kernel has to wait until all pending functions
in a work queue have been executed. The flush_workqueue( ) function receives a
workqueue_struct descriptor
address and blocks the calling process until all functions that are
pending in the work queue terminate. The function, however, does not
wait for any pending function that was added to the work queue
following flush_workqueue( )
invocation; the remove_sequence
and insert_sequence fields of
every cpu_workqueue_struct
descriptor are used to recognize the newly added pending
functions.
在大多数情况下,创建一整套工作线程来运行一个函数是多余的。因此,内核提供了一个名为events 的预定义工作队列,每个内核开发人员都可以自由使用它。预定义的工作队列只不过是一个标准的工作队列,可能包括不同内核层和I/O驱动程序的功能;它的workqueue_struct描述符存储在keventd_wq数组中。为了使用预定义的工作队列,内核提供了表 4-14中列出的函数。
In most cases, creating a whole set of worker threads
in order to run a function is overkill. Therefore, the kernel offers
a predefined work queue called events, which
can be freely used by every kernel developer. The predefined work
queue is nothing more than a standard work queue that may include
functions of different kernel layers and I/O drivers; its workqueue_struct descriptor is stored in
the keventd_wq array. To make use
of the predefined work queue, the kernel offers the functions listed
in Table
4-14.
表 4-14。预定义工作队列的辅助函数
Table 4-14. Helper functions for the predefined work queue
预定义工作队列功能 Predefined work queue function | 等效标准工作队列功能 Equivalent standard work queue function |
|---|---|
| |
| |
Schedule_delayed_work_on(CPU,w,d) schedule_delayed_work_on(cpu,w,d) | |
| |
当函数很少被调用时,预定义的工作队列可以节省大量的系统资源。另一方面,在预定义工作队列中执行的函数不应该长时间阻塞:因为工作队列列表中待处理函数的执行在每个CPU上都是串行的,所以长时间的延迟会对预定义工作的其他用户产生负面影响队列。
The predefined work queue saves significant system resources when the function is seldom invoked. On the other hand, functions executed in the predefined work queue should not block for a long time: because the execution of the pending functions in the work queue list is serialized on each CPU, a long delay negatively affects the other users of the predefined work queue.
除了一般的事件队列之外,您还会在 Linux 2.6 中发现一些专门的工作队列。最重要的是块设备层使用的kblockd工作队列(参见第14章)。
In addition to the general events queue, you'll find a few specialized work queues in Linux 2.6. The most significant is the kblockd work queue used by the block device layer (see Chapter 14).
[ * ]在多处理器系统中复制工作队列数据结构的原因是每个CPU的本地数据结构产生更高效的代码(参见第5章中的“每CPU变量”部分)。
[*] The reason for duplicating the work queue data structures in multiprocessor systems is that per-CPU local data structures yield a much more efficient code (see the section "Per-CPU Variables" in Chapter 5).
[ * ]奇怪的是,工作线程可以由每个 CPU 执行,而不仅仅是cpu_workqueue_struct工作线程所属描述符对应的 CPU。因此,queue_work( )在本地CPU的队列中插入一个函数,但该函数可以由系统中的任何CPU执行。
[*] Strangely enough, a worker thread can be executed by every
CPU, not just the CPU corresponding to the cpu_workqueue_struct descriptor to
which the worker thread belongs. Therefore, queue_work( ) inserts a function in
the queue of the local CPU, but that function may be executed by
any CPU in the systems.
我们将通过检查终止阶段来结束这一章 中断和异常处理程序。(从系统调用返回是一种特殊情况,我们将在第 10 章中描述它。)尽管主要目标很明确——即恢复某个程序的执行——但在执行此操作之前必须考虑几个问题:
We will finish the chapter by examining the termination phase of interrupt and exception handlers. (Returning from a system call is a special case, and we shall describe it in Chapter 10.) Although the main objective is clear — namely, to resume execution of some program — several issues must be considered before doing it:
如果只有一个,CPU 必须切换回用户模式。
If there is just one, the CPU must switch back to User Mode.
如果有请求,内核必须进行进程调度;否则,控制权返回到当前进程。
If there is any request, the kernel must perform process scheduling; otherwise, control is returned to the current process.
如果一个信号被发送到当前进程,则必须对其进行处理。
If a signal is sent to the current process, it must be handled.
如果调试器正在跟踪当前进程的执行,则在切换回用户模式之前必须恢复单步模式。
If a debugger is tracing the execution of the current process, single-step mode must be restored before switching back to User Mode.
如果CPU处于虚拟8086模式,则当前进程正在执行传统实模式程序,因此必须以特殊方式处理。
If the CPU is in virtual-8086 mode, the current process is executing a legacy Real Mode program, thus it must be handled in a special way.
一些标志用于跟踪待处理的进程切换请求和待处理的信号,以及单步执行;它们存储在
描述符flags的字段中thread_info。该字段还存储其他标志,但它们与从中断和异常返回无关。有关这些标志的完整列表,请参阅表 4-15 。
A few flags are used to keep track of pending process switch
requests, of pending signals , and of single step execution; they are stored in the
flags field of the thread_info descriptor. The field stores other
flags as well, but they are not related to returning from interrupts and
exceptions. See Table
4-15 for a complete list of these flags.
表 4-15。thread_info描述符的flags字段(续)
Table 4-15. The flags field of the thread_info descriptor (continues)
旗帜名称 Flag name | 描述 Description |
|---|---|
| 正在跟踪系统调用 System calls are being traced |
| 不用于80×86平台 Not used in the 80 × 86 platform |
| 进程有未决信号 The process has pending signals |
| 必须执行调度 Scheduling must be performed |
| 返回用户模式时恢复单步执行 Restore single step execution on return to User Mode |
|
Force return from system call via
|
| 正在审核系统调用 System calls are being audited |
| 空闲进程正在轮询
The idle process is polling the
|
| 正在销毁该进程以回收内存(请参阅第 17 章中的“内存不足杀手”部分) The process is being destroyed to reclaim memory (see the section "The Out of Memory Killer" in Chapter 17) |
从技术上讲,完成所有这些事情的内核汇编语言代码并不是一个函数,因为控制权永远不会返回到调用它的函数。它是一段具有两个不同入口点的代码:ret_from_intr(
)和ret_from_exception(
)。顾名思义,内核在终止中断处理程序时进入前者,在终止异常处理程序时进入后者。我们将这两个入口点称为函数,因为这使得描述更简单。
The kernel assembly language code that accomplishes all these
things is not, technically speaking, a function, because control is
never returned to the functions that invoke it. It is a piece of code
with two different entry points: ret_from_intr(
) and ret_from_exception(
). As their names suggest, the kernel enters the former when
terminating an interrupt handler, and it enters the latter when
terminating an exception handler. We shall refer to the two entry points
as functions, because this makes the description simpler.
具有相应两个入口点的总体流程图如图4-6所示。灰色框指的是实现内核抢占的汇编语言指令(见第5章);如果你想看看内核在不支持内核抢占的情况下编译时会做什么,只需忽略灰色框即可。流程图中的ret_from_exception( )和入口点看起来非常相似。ret_from_intr( )仅当选择支持内核抢占作为编译选项时才存在差异:在这种情况下,从异常返回时立即禁用本地中断。
The general flow diagram with the corresponding two entry points
is illustrated in Figure
4-6. The gray boxes refer to assembly language instructions that
implement kernel preemption (see Chapter 5); if you want to see what
the kernel does when it is compiled without support for kernel
preemption, just ignore the gray boxes. The ret_from_exception( ) and ret_from_intr( ) entry points look quite
similar in the flow diagram. A difference exists only if support for
kernel preemption has been selected as a compilation option: in this
case, local interrupts are immediately disabled when returning from
exceptions.
该流程图给出了恢复执行被中断的程序所需的步骤的粗略概念。现在我们将通过讨论汇编语言代码来详细讨论。
The flow diagram gives a rough idea of the steps required to resume the execution of an interrupted program. Now we will go into detail by discussing the assembly language code.
和入口点本质上ret_from_intr( )等同
ret_from_exception( )于以下汇编语言代码:
The ret_from_intr( ) and
ret_from_exception( ) entry
points are essentially equivalent to the following assembly language
code:
ret_from_exception:
命令行;如果不支持内核抢占则缺失
ret_from_intr:
movl $-8192, %ebp ; -4096 如果使用多个内核模式堆栈
andl %esp, %ebp
movl 0x30(%esp), %eax
movb 0x2c(%esp), %al
测试 $0x00020003, %eax
jnz 简历用户空间
jpm 简历内核 ret_from_exception:
cli ; missing if kernel preemption is not supported
ret_from_intr:
movl $-8192, %ebp ; -4096 if multiple Kernel Mode stacks are used
andl %esp, %ebp
movl 0x30(%esp), %eax
movb 0x2c(%esp), %al
testl $0x00020003, %eax
jnz resume_userspace
jpm resume_kernel回想一下,当从中断返回时,本地中断被禁用(参见前面描述中的步骤 3
handle_IRQ_event( ));就这样
cli 仅当从异常返回时才执行汇编语言指令。
Recall that when returning from an interrupt, the local
interrupts are disabled (see step 3 in the earlier description of
handle_IRQ_event( )); thus, the
cli assembly language instruction is executed only when
returning from an exception.
内核加载寄存器中thread_info描述符的地址(参见第3章中的“识别进程” )。currentebp
The kernel loads the address of the thread_info descriptor of current in the ebp register (see "Identifying a Process"
in Chapter 3).
cs
接下来,和的值eflags 当中断或异常发生时,寄存器被压入堆栈,用于确定被中断的程序是否在用户模式下运行,或者是否设置了VM标志。[ * ]无论哪种情况,都会跳转到标签。否则,将跳转至
标签。eflagsresume_userspaceresume_kernel
Next, the values of the cs
and eflags registers, which were pushed on the stack when the
interrupt or the exception occurred, are used to determine whether
the interrupted program was running in User Mode, or if the VM flag of eflags was set.[*] In either case, a jump is made to the resume_userspace label. Otherwise, a jump
is made to the resume_kernel
label.
resume_kernel如果要恢复的程序运行在内核态,则执行标号处的汇编语言代码:
The assembly language code at the resume_kernel label is executed if the
program to be resumed is running in Kernel Mode:
简历内核:
命令行;这三个指令是
cmpl $0, 0x14(%ebp) ; 如果内核抢占则缺失
jz 需要重新安排;不支持
恢复所有:
人口 %ebx
人口%ecx
人口 %edx
人口 %esi
人口%edi
人口 %ebp
人口%eax
%ds 人口
人口 %es
addl $4,%esp
伊雷特 resume_kernel:
cli ; these three instructions are
cmpl $0, 0x14(%ebp) ; missing if kernel preemption
jz need_resched ; is not supported
restore_all:
popl %ebx
popl %ecx
popl %edx
popl %esi
popl %edi
popl %ebp
popl %eax
popl %ds
popl %es
addl $4, %esp
iret如果描述符preempt_count的字段thread_info为零(启用内核抢占),则内核跳转到该need_resched标签。否则,将重新启动被中断的程序。该函数将中断或异常开始时保存的值加载到寄存器中,并且该函数通过执行指令来交出控制权iret。
If the preempt_count field
of the thread_info descriptor is
zero (kernel preemption enabled), the kernel jumps to the need_resched label. Otherwise, the
interrupted program is to be restarted. The function loads the
registers with the values saved when the interrupt or the exception
started, and the function yields control by executing the iret instruction.
当执行此代码时,未完成的内核控制路径都不是中断处理程序,否则该preempt_count字段将大于零。然而,正如本章前面的“异常和中断处理程序的嵌套执行”中所述,最多可能有两个与异常相关的内核控制路径(除了正在终止的路径之外)。
When this code is executed, none of the unfinished kernel
control paths is an interrupt handler, otherwise the preempt_count field would be greater than
zero. However, as stated in "Nested Execution of Exception
and Interrupt Handlers" earlier in this chapter, there could
be up to two kernel control paths associated with exceptions (beside
the one that is terminating).
需要重新安排:
movl 0x8(%ebp), %ecx
测试b $(1<<TIF_NEED_RESCHED), %cl
jz 恢复全部
测试 $0x00000200,0x30(%esp)
jz 恢复全部
调用 preempt_schedule_irq
jmp need_resched need_resched:
movl 0x8(%ebp), %ecx
testb $(1<<TIF_NEED_RESCHED), %cl
jz restore_all
testl $0x00000200,0x30(%esp)
jz restore_all
call preempt_schedule_irq
jmp need_resched如果字段
TIF_NEED_RESCHED
中的标志为零,则不需要过程切换,因此跳转到标签。如果正在恢复的内核控制路径在禁用本地中断的情况下运行,也会跳转到相同的标签。在这种情况下,进程切换可能会损坏内核数据结构(有关更多详细信息,请参阅第 5 章中的“何时需要同步”部分)。flagscurrent->thread_inforestore_all
If the TIF_NEED_RESCHED
flag in the flags field of
current->thread_info is zero,
no process switch is required, thus a jump is made to the restore_all label. Also a jump to the same
label is made if the kernel control path that is being resumed was
running with the local interrupts disabled. In this case a process
switch could corrupt kernel data structures (see the section "When Synchronization Is
Necessary" in Chapter
5 for more details).
如果需要进程切换,preempt_schedule_irq( )则调用该函数:它设置PREEMPT_ACTIVE字段中的标志preempt_count,暂时将大内核锁计数器设置为1(请参阅第5章中的“大内核锁”-部分),启用本地中断,并且调用以选择另一个要运行的进程。当前一个进程恢复时,恢复大内核锁计数器的先前值,清除标志,并禁用本地中断。
只要当前进程的标志被设置,该函数就会继续被调用。schedule(
)preempt_schedule_irq(
)PREEMPT_ACTIVEschedule( )TIF_NEED_RESCHED
If a process switch is required, the preempt_schedule_irq( ) function is
invoked: it sets the PREEMPT_ACTIVE flag in the preempt_count field, temporarily sets the
big kernel lock counter to -1
(see the section "The
Big Kernel Lock" in Chapter 5), enables the local
interrupts, and invokes schedule(
) to select another process to run. When the former
process will resume, preempt_schedule_irq(
) restores the previous value of the big kernel lock
counter, clears the PREEMPT_ACTIVE flag, and disables local
interrupts. The schedule( )
function will continue to be invoked as long as the TIF_NEED_RESCHED flag of the current
process is set.
如果要恢复的程序在用户模式下运行,则会跳转到标签resume_userspace:
If the program to be resumed was running in User Mode,
a jump is made to the resume_userspace label:
恢复用户空间:
命令行
movl 0x8(%ebp), %ecx
andl $0x0000ff6e, %ecx
我恢复全部
jmp 工作待处理 resume_userspace:
cli
movl 0x8(%ebp), %ecx
andl $0x0000ff6e, %ecx
je restore_all
jmp work_pending禁用本地中断后,将检查 字段的flags值
current->thread_info。如果除了TIF_SYSCALL_TRACE、
TIF_SYSCALL_AUDIT、 或TIF_SINGLESTEP之外没有设置任何标志,则无需执行任何操作:跳转到restore_all标签,从而恢复用户模式程序。
After disabling the local interrupts, a check is made on the
value of the flags field of
current->thread_info. If no
flag except TIF_SYSCALL_TRACE,
TIF_SYSCALL_AUDIT, or TIF_SINGLESTEP is set, nothing remains to
be done: a jump is made to the restore_all label, thus resuming the User
Mode program.
描述符中的标志thread_info表明在恢复中断的程序之前需要进行额外的工作。
The flags in the thread_info descriptor state that
additional work is required before resuming the interrupted
program.
工作待处理:
测试b $(1<<TIF_NEED_RESCHED), %cl
jz work_notifysig
工作重新安排:
通话时间表
命令行
jmp 恢复用户空间 work_pending:
testb $(1<<TIF_NEED_RESCHED), %cl
jz work_notifysig
work_resched:
call schedule
cli
jmp resume_userspace如果进程切换请求待处理,schedule( )则调用该函数来选择要运行的另一个进程。当前一个进程恢复时,会跳转回resume_userspace。
If a process switch request is pending, schedule( ) is invoked to select another
process to run. When the former process will resume, a jump is made
back to resume_userspace.
There is other work to be done besides process switch requests:
工作通知信号:
movl %esp, %eax
测试 $0x00020000, 0x30(%esp)
杰1f
work_notifysig_v86:
推 %ecx
调用 save_v86_state
人口%ecx
movl %eax, %esp
1:
xorl %edx, %edx
调用 do_notify_resume
jmp恢复全部 work_notifysig:
movl %esp, %eax
testl $0x00020000, 0x30(%esp)
je 1f
work_notifysig_v86:
pushl %ecx
call save_v86_state
popl %ecx
movl %eax, %esp
1:
xorl %edx, %edx
call do_notify_resume
jmp restore_all如果VM控制标志在eflags 设置用户模式程序的寄存器后,
save_v86_state( )调用该函数在用户模式地址空间中构建虚拟8086模式数据结构。然后do_notify_resume( )调用该函数来处理待处理信号和单步执行。最后,跳转到restore_all标签以恢复中断的程序。
If the VM control flag in
the eflags register of the User Mode program is set, the
save_v86_state( ) function is
invoked to build up the virtual-8086 mode data structures in the
User Mode address space. Then the do_notify_resume( ) function is invoked to
take care of pending signals and single stepping. Finally, a jump is
made to the restore_all label to
resume the interrupted program.
您可以将内核视为应答请求的服务器;这些请求可以来自 CPU 上运行的进程,也可以来自发出中断请求的外部设备。我们做这个类比是为了强调内核的各个部分不是串行运行的,而是以交错的方式运行的。因此,它们可能会引起竞争条件,必须通过适当的同步技术来控制竞争条件。这些主题的一般介绍可以在第 1 章的“ Unix 内核概述” 部分中找到。
You could think of the kernel as a server that answers requests; these requests can come either from a process running on a CPU or an external device issuing an interrupt request. We make this analogy to underscore that parts of the kernel are not run serially, but in an interleaved way. Thus, they can give rise to race conditions, which must be controlled through proper synchronization techniques. A general introduction to these topics can be found in the section "An Overview of Unix Kernels" in Chapter 1.
本章首先回顾一下内核请求以交错方式执行的时间和程度。然后,我们介绍内核实现的基本同步原语,并描述它们在最常见情况下的应用方式。最后,我们举例说明一些实际例子。
We start this chapter by reviewing when, and to what extent, kernel requests are executed in an interleaved fashion. We then introduce the basic synchronization primitives implemented by the kernel and describe how they are applied in the most common cases. Finally, we illustrate a few practical examples.
为了更好地理解内核代码是如何执行的,我们将内核视为必须满足两种类型的请求的服务员:由客户发出的请求和由有限数量的不同老板发出的请求。服务员采取的政策如下:
To get a better grasp of how kernel's code is executed, we will look at the kernel as a waiter who must satisfy two types of requests: those issued by customers and those issued by a limited number of different bosses. The policy adopted by the waiter is the following:
如果老板在服务员闲着的时候打来电话,服务员就开始为老板服务。
If a boss calls while the waiter is idle, the waiter starts servicing the boss.
如果服务员在为顾客服务时老板打来电话,服务员就会停止为顾客服务并开始为老板服务。
If a boss calls while the waiter is servicing a customer, the waiter stops servicing the customer and starts servicing the boss.
如果一个老板在服务员正在为另一个老板服务时打来电话,服务员就会停止为第一个老板服务并开始为第二个老板服务。当他为新老板服务完毕后,他又继续为前任老板服务。
If a boss calls while the waiter is servicing another boss, the waiter stops servicing the first boss and starts servicing the second one. When he finishes servicing the new boss, he resumes servicing the former one.
一位老板可能会诱导服务员离开正在服务的顾客。在满足了老板的最后一个要求后,服务员可能会决定暂时放下他的顾客并去接一位新顾客。
One of the bosses may induce the waiter to leave the customer being currently serviced. After servicing the last request of the bosses, the waiter may decide to drop temporarily his customer and to pick up a new one.
等待者执行的服务对应于CPU处于内核模式时执行的代码。如果CPU正在用户模式下执行,则等待者被认为是空闲的。
The services performed by the waiter correspond to the code executed when the CPU is in Kernel Mode. If the CPU is executing in User Mode, the waiter is considered idle.
老板请求对应于中断,而客户请求对应于系统调用或用户模式进程引发的异常。正如我们将在第 10 章中详细看到的,想要从内核请求服务的用户模式进程必须发出适当的指令(在 80×86 上,int $0x80或一个sysenter 操作说明)。此类指令会引发异常,强制 CPU 从用户模式切换到内核模式。在本章的其余部分中,我们通常将系统调用和常见异常都表示为“异常”。
Boss requests correspond to interrupts, while customer requests
correspond to system calls or exceptions raised by User Mode processes.
As we shall see in detail in Chapter
10, User Mode processes that want to request a service from the
kernel must issue an appropriate instruction (on the 80×86, an int $0x80 or a sysenter instruction). Such instructions raise an exception that
forces the CPU to switch from User Mode to Kernel Mode. In the rest of
this chapter, we will generally denote as "exceptions" both the system
calls and the usual exceptions.
细心的读者已经将前三个规则与第 4 章“异常和中断处理程序的嵌套执行”中描述的内核控制路径的嵌套联系起来。第四条规则对应 Linux 2.6 内核中包含的最有趣的新功能之一,即内核抢占 。
The careful reader has already associated the first three rules with the nesting of kernel control paths described in "Nested Execution of Exception and Interrupt Handlers" in Chapter 4. The fourth rule corresponds to one of the most interesting new features included in the Linux 2.6 kernel, namely kernel preemption .
给内核抢占一个好的定义是非常困难的。作为第一次尝试, 如果在被替换进程正在执行内核函数(即在内核模式下运行)时可能发生进程切换,我们可以说内核是抢占式的。不幸的是,在 Linux(以及任何其他实际操作系统)中,事情要复杂得多:
It is surprisingly hard to give a good definition of kernel preemption. As a first try, we could say that a kernel is preemptive if a process switch may occur while the replaced process is executing a kernel function, that is, while it runs in Kernel Mode. Unfortunately, in Linux (as well as in any other real operating system) things are much more complicated:
在抢占式和非抢占式内核中,在内核模式下运行的进程都可以自愿放弃 CPU,例如因为它必须休眠以等待某些资源。我们将这种流程切换称为有 计划的进程切换。然而,抢占式内核与非抢占式内核的不同之处在于,在内核模式下运行的进程对可能引发进程切换的异步事件做出反应,例如,唤醒更高优先级进程的中断处理程序。我们将这种进程切换称为强制进程切换。
Both in preemptive and nonpreemptive kernels, a process running in Kernel Mode can voluntarily relinquish the CPU, for instance because it has to sleep waiting for some resource. We will call this kind of process switch a planned process switch. However, a preemptive kernel differs from a nonpreemptive kernel on the way a process running in Kernel Mode reacts to asynchronous events that could induce a process switch—for instance, an interrupt handler that awakes a higher priority process. We will call this kind of process switch a forced process switch.
所有进程切换均由宏执行switch_to。在抢占式和非抢占式内核中,当进程完成内核活动的某些线程并且调用调度程序时,就会发生进程切换。然而,在非抢占式内核中,除非要切换到用户模式,否则当前进程无法被替换(请参阅第3 章中的“执行进程切换”一节)。
All process switches are performed by the switch_to macro. In both preemptive and
nonpreemptive kernels, a process switch occurs when a process has
finished some thread of kernel activity and the scheduler is
invoked. However, in nonpreemptive kernels, the current process
cannot be replaced unless it is about to switch to User Mode (see
the section "Performing the Process
Switch" in Chapter
3).
因此,抢占式内核的主要特点是,在内核模式下运行的进程可以在内核函数执行过程中被另一个进程替换。
Therefore, the main characteristic of a preemptive kernel is that a process running in Kernel Mode can be replaced by another process while in the middle of a kernel function.
让我们举几个例子来说明抢占式内核和非抢占式内核之间的区别。
Let's give a couple of examples to illustrate the difference between preemptive and nonpreemptive kernels.
当进程 A 执行异常处理程序(必须在内核模式下)时,优先级较高的进程 B 变得可以运行。例如,如果发生 IRQ 并且相应的处理程序唤醒进程 B,则可能会发生这种情况。如果内核是抢占式的,则强制进程切换会将进程 A 替换为 B。异常处理程序未完成,只有当调度程序选择再次进程A执行。相反,如果内核是非抢占式的,则在进程 A 完成处理异常处理程序或自愿放弃 CPU 之前,不会发生进程切换。
While process A executes an exception handler (necessarily in Kernel Mode), a higher priority process B becomes runnable. This could happen, for instance, if an IRQ occurs and the corresponding handler awakens process B. If the kernel is preemptive, a forced process switch replaces process A with B. The exception handler is left unfinished and will be resumed only when the scheduler selects again process A for execution. Conversely, if the kernel is nonpreemptive, no process switch occurs until process A either finishes handling the exception handler or voluntarily relinquishes the CPU.
再举一个例子,考虑一个执行异常处理程序并且其时间量程到期的进程(请参阅第 7 章中的“ scheduler_tick() 函数”部分)。如果内核是抢占式的,进程可能会立即被替换;但是,如果内核是非抢占式的,则进程将继续运行,直到完成处理异常处理程序或自愿放弃 CPU。
For another example, consider a process that executes an exception handler and whose time quantum expires (see the section "The scheduler_tick( ) Function" in Chapter 7). If the kernel is preemptive, the process may be replaced immediately; however, if the kernel is nonpreemptive, the process continues to run until it finishes handling the exception handler or voluntarily relinquishes the CPU.
使内核抢占的主要动机是减少调度延迟 用户模式进程的时间,即它们变得可运行的时间和它们实际开始运行的时间之间的延迟。执行及时计划任务的进程(例如外部硬件控制器、环境监视器、电影播放器等)确实受益于内核抢占,因为它降低了被在内核模式下运行的另一个进程延迟的风险。
The main motivation for making a kernel preemptive is to reduce the dispatch latency of the User Mode processes, that is, the delay between the time they become runnable and the time they actually begin running. Processes performing timely scheduled tasks (such as external hardware controllers, environmental monitors, movie players, and so on) really benefit from kernel preemption, because it reduces the risk of being delayed by another process running in Kernel Mode.
与旧的非抢占式内核版本相比,使 Linux 2.6 内核成为抢占式内核并不需要对内核设计进行重大更改。正如第 4 章“从中断和异常返回”部分所述,当宏引用的描述符
中的字段大于零时,内核抢占被禁用。该字段编码三个不同的计数器,如第4章
表4-10所示,因此当出现以下任何一种情况时,该字段大于零:preempt_countthread_infocurrent_thread_info( )
Making the Linux 2.6 kernel preemptive did not require a drastic
change in the kernel design with respect to the older nonpreemptive
kernel versions. As described in the section "Returning from Interrupts and
Exceptions" in Chapter
4, kernel preemption is disabled when the preempt_count field in the thread_info descriptor referenced by the
current_thread_info( ) macro is
greater than zero. The field encodes three different counters, as
shown in Table 4-10
in Chapter 4, so it is
greater than zero when any of the following cases occurs:
内核正在执行中断服务程序。
The kernel is executing an interrupt service routine.
可延迟函数被禁用(当内核执行软中断或微线程时始终为真)。
The deferrable functions are disabled (always true when the kernel is executing a softirq or tasklet).
通过将抢占计数器设置为正值,已显式禁用内核抢占。
The kernel preemption has been explicitly disabled by setting the preemption counter to a positive value.
上述规则告诉我们,只有当内核正在执行异常处理程序(特别是系统调用)并且内核抢占未被显式禁用时,内核才可以被抢占。此外,如第4章“从中断和异常返回”一节所述,本地CPU必须启用本地中断,否则不执行内核抢占。
The above rules tell us that the kernel can be preempted only when it is executing an exception handler (in particular a system call) and the kernel preemption has not been explicitly disabled. Furthermore, as described in the section "Returning from Interrupts and Exceptions" in Chapter 4, the local CPU must have local interrupts enabled, otherwise kernel preemption is not performed.
表 5-1中列出的一些简单宏处理现场抢占计数器prempt_count。
A few simple macros listed in Table 5-1 deal with the
preemption counter in the prempt_count field.
表 5-1。处理抢占计数器子字段的宏
Table 5-1. Macros dealing with the preemption counter subfield
宏 Macro | 描述 Description |
|---|---|
| 选择
描述符 Selects the |
| 将抢占计数器的值加一 Increases by one the value of the preemption counter |
| 将抢占计数器的值减一 Decreases by one the value of the preemption counter |
| 将抢占计数器的值减一,并在
描述符中的标志已设置 Decreases by one the value of the
preemption counter, and invokes |
获取CPU() get_cpu( ) | 类似 Similar to |
put_cpu() put_cpu( ) | 与...一样 Same as |
put_cpu_no_resched() put_cpu_no_resched( ) | 与...一样 Same as |
宏preempt_enable( )减少抢占计数器,然后检查TIF_NEED_RESCHED标志是否已设置(参见第 4 章中的表 4-15)。在这种情况下,进程切换请求处于待处理状态,因此宏调用该函数,该函数本质上执行以下代码:preempt_schedule( )
The preempt_enable( ) macro
decreases the preemption counter, then checks whether the TIF_NEED_RESCHED flag is set (see Table 4-15 in Chapter 4). In this case, a
process switch request is pending, so the macro invokes the preempt_schedule( ) function, which
essentially executes the following code:
if (!current_thread_info->preempt_count && !irqs_disabled()) {
current_thread_info->preempt_count = PREEMPT_ACTIVE;
日程();
current_thread_info->preempt_count = 0;
} if (!current_thread_info->preempt_count && !irqs_disabled()) {
current_thread_info->preempt_count = PREEMPT_ACTIVE;
schedule();
current_thread_info->preempt_count = 0;
}该函数检查本地中断是否使能且 的
preempt_count字段是否current为零;如果两个条件都为真,则调用schedule( )选择另一个进程来运行。因此,当内核控制路径(通常是中断处理程序)终止时,或者当异常处理程序通过preempt_enable( ). 正如我们将在本章后面的“禁用和启用可延迟函数”部分中看到的,当启用可延迟函数时,也可能会发生内核抢占。
The function checks whether local interrupts are enabled and the
preempt_count field of current is zero; if both conditions are
true, it invokes schedule( ) to
select another process to run. Therefore, kernel preemption may happen
either when a kernel control path (usually, an interrupt handler) is
terminated, or when an exception handler reenables kernel preemption
by means of preempt_enable( ). As
we'll see in the section "Disabling and Enabling
Deferrable Functions" later in this chapter, kernel preemption
may also happen when deferrable functions are enabled.
我们将通过注意到内核抢占引入不可忽略的开销来结束本节。因此,Linux 2.6 提供了一个内核配置选项,允许用户在编译内核时启用或禁用内核抢占。
We'll conclude this section by noticing that kernel preemption introduces a nonnegligible overhead. For that reason, Linux 2.6 features a kernel configuration option that allows users to enable or disable kernel preemption when compiling the kernel.
第一章 介绍了进程的竞争条件和临界区域的概念。相同的定义适用于内核控制路径。在本章中,当计算结果取决于两个或多个交错内核控制路径的嵌套方式时,可能会出现竞争条件。临界区是一段代码,在另一个内核控制路径可以进入它之前,进入它的内核控制路径必须完全执行它。
Chapter 1 introduced the concepts of race condition and critical region for processes. The same definitions apply to kernel control paths . In this chapter, a race condition can occur when the outcome of a computation depends on how two or more interleaved kernel control paths are nested. A critical region is a section of code that must be completely executed by the kernel control path that enters it before another kernel control path can enter it.
交错的内核控制路径使内核开发人员的工作变得复杂:他们必须特别小心才能识别关键区域在异常处理程序、中断处理程序、可延迟函数和内核线程中。一旦确定了关键区域,就必须对其进行适当的保护,以确保任何时候最多有一个内核控制路径位于该区域内。
Interleaving kernel control paths complicates the life of kernel developers: they must apply special care in order to identify the critical regions in exception handlers, interrupt handlers, deferrable functions, and kernel threads . Once a critical region has been identified, it must be suitably protected to ensure that any time at most one kernel control path is inside that region.
例如,假设两个不同的中断处理程序需要访问包含多个相关成员变量的同一数据结构 - 例如,一个缓冲区和一个指示其长度的整数。所有影响数据结构的语句都必须放入单个关键区域。如果系统包括单个CPU,则可以通过在访问共享数据结构时禁用中断来实现临界区,因为内核控制路径的嵌套只能在中断启用时发生。
Suppose, for instance, that two different interrupt handlers need to access the same data structure that contains several related member variables — for instance, a buffer and an integer indicating its length. All statements affecting the data structure must be put into a single critical region. If the system includes a single CPU, the critical region can be implemented by disabling interrupts while accessing the shared data structure, because nesting of kernel control paths can only occur when interrupts are enabled.
另一方面,如果同一数据结构仅由系统调用的服务例程访问,并且如果系统包括单个CPU,则可以通过在访问共享数据结构时禁用内核抢占来非常简单地实现临界区域。
On the other hand, if the same data structure is accessed only by the service routines of system calls, and if the system includes a single CPU, the critical region can be implemented quite simply by disabling kernel preemption while accessing the shared data structure.
正如您所料,多处理器系统中的情况更为复杂。许多 CPU 可能同时执行内核代码,因此内核开发人员不能仅仅因为禁用了内核抢占并且数据结构永远不会被中断、异常或软中断处理程序寻址,就认为可以安全地访问数据结构。
As you would expect, things are more complicated in multiprocessor systems. Many CPUs may execute kernel code at the same time, so kernel developers cannot assume that a data structure can be safely accessed just because kernel preemption is disabled and the data structure is never addressed by an interrupt, exception, or softirq handler.
我们将在以下部分中看到内核提供了多种不同的同步技术。内核设计者需要通过选择最有效的技术来解决每个同步问题。
We'll see in the following sections that the kernel offers a wide range of different synchronization techniques. It is up to kernel designers to solve each synchronization problem by selecting the most efficient technique.
前一章中已经讨论的一些设计选择在一定程度上简化了内核控制路径的同步。让我们简单回顾一下:
Some design choices already discussed in the previous chapter simplify somewhat the synchronization of kernel control paths. Let us recall them briefly:
所有中断处理程序都会确认 PIC 上的中断并禁用 IRQ 线。在处理程序终止之前,不会再发生相同的中断。
All interrupt handlers acknowledge the interrupt on the PIC and also disable the IRQ line. Further occurrences of the same interrupt cannot occur until the handler terminates.
中断处理程序、软中断和微线程都是不可抢占和非阻塞的,因此它们不能长时间挂起。在最坏的情况下,它们的执行会稍微延迟,因为在它们执行期间(内核控制路径的嵌套执行)发生其他中断。
Interrupt handlers, softirqs, and tasklets are both nonpreemptable and non-blocking, so they cannot be suspended for a long time interval. In the worst case, their execution will be slightly delayed, because other interrupts occur during their execution (nested execution of kernel control paths).
执行中断处理的内核控制路径不能被执行可延迟函数或系统调用服务例程的内核控制路径中断。
A kernel control path performing interrupt handling cannot be interrupted by a kernel control path executing a deferrable function or a system call service routine.
软中断和微线程不能在给定的 CPU 上交错。
Softirqs and tasklets cannot be interleaved on a given CPU.
同一个tasklet不能同时在多个CPU上执行。
The same tasklet cannot be executed simultaneously on several CPUs.
上述每个设计选择都可以被视为一种约束,可以利用它来更轻松地编写一些内核函数。以下是一些可能的简化示例:
Each of the above design choices can be viewed as a constraint that can be exploited to code some kernel functions more easily. Here are a few examples of possible simplifications:
中断处理程序和微线程不需要编码为可重入函数。
Interrupt handlers and tasklets need not to be coded as reentrant functions.
仅由软中断和微线程访问的每 CPU 变量不需要同步。
Per-CPU variables accessed by softirqs and tasklets only do not require synchronization.
仅由一种微线程访问的数据结构不需要同步。
A data structure accessed by only one kind of tasklet does not require synchronization.
本章的其余部分描述了当需要同步时该怎么做——即如何防止由于对共享数据结构的不安全访问而导致的数据损坏。
The rest of this chapter describes what to do when synchronization is necessary — i.e., how to prevent data corruption due to unsafe accesses to shared data structures.
我们现在研究如何交错内核控制路径,同时避免共享数据之间的竞争条件。表 5-2列出了 Linux 内核使用的同步技术。“范围”列指示同步技术是适用于系统中的所有 CPU 还是单个 CPU。例如,本地中断禁用仅适用于一个CPU(系统中的其他CPU不受影响);相反,原子操作会影响系统中的所有CPU(多个CPU上的原子操作在访问相同的数据结构时不能交织)。
We now examine how kernel control paths can be interleaved while avoiding race conditions among shared data. Table 5-2 lists the synchronization techniques used by the Linux kernel. The "Scope" column indicates whether the synchronization technique applies to all CPUs in the system or to a single CPU. For instance, local interrupt disabling applies to just one CPU (other CPUs in the system are not affected); conversely, an atomic operation affects all CPUs in the system (atomic operations on several CPUs cannot interleave while accessing the same data structure).
表 5-2。内核使用的各种类型的同步技术
Table 5-2. Various types of synchronization techniques used by the kernel
技术 Technique | 描述 Description | 范围 Scope |
|---|---|---|
每个 CPU 变量 Per-CPU variables | 在 CPU 之间复制数据结构 Duplicate a data structure among the CPUs | 所有CPU All CPUs |
原子操作 Atomic operation | 对计数器的原子读-修改-写指令 Atomic read-modify-write instruction to a counter | 所有CPU All CPUs |
内存屏障 Memory barrier | 避免指令重新排序 Avoid instruction reordering | 本地CPU或所有CPU Local CPU or All CPUs |
自旋锁 Spin lock | 锁定忙等待 Lock with busy wait | 所有CPU All CPUs |
信号 Semaphore | 锁定并阻塞等待(睡眠) Lock with blocking wait (sleep) | 所有CPU All CPUs |
序列锁 Seqlocks | 基于访问计数器的锁定 Lock based on an access counter | 所有CPU All CPUs |
本地中断禁用 Local interrupt disabling | 禁止单个 CPU 上的中断处理 Forbid interrupt handling on a single CPU | 本地CPU Local CPU |
本地软中断禁用 Local softirq disabling | 禁止在单个 CPU 上处理可延迟函数 Forbid deferrable function handling on a single CPU | 本地CPU Local CPU |
读取-复制-更新 (RCU) Read-copy-update (RCU) | 通过指针对共享数据结构进行无锁访问 Lock-free access to shared data structures through pointers | 所有CPU All CPUs |
现在让我们简要讨论每种同步技术。在后面的“同步对内核数据结构的访问”部分中,我们将展示如何组合这些同步技术来有效地保护内核数据结构。
Let's now briefly discuss each synchronization technique. In the later section "Synchronizing Accesses to Kernel Data Structures," we show how these synchronization techniques can be combined to effectively protect kernel data structures.
最好的同步技术在于设计内核,以便首先避免同步的需要。正如我们将看到的,事实上,每个显式同步原语都有显着的性能成本。
The best synchronization technique consists in designing the kernel so as to avoid the need for synchronization in the first place. As we'll see, in fact, every explicit synchronization primitive has a significant performance cost.
最简单、最有效的同步技术是将内核变量声明为每个 CPU 的变量 。基本上,每 CPU 变量是一个数据结构数组,系统中每个 CPU 都有一个元素。
The simplest and most efficient synchronization technique consists of declaring kernel variables as per-CPU variables . Basically, a per-CPU variable is an array of data structures, one element per each CPU in the system.
一个CPU不应该访问其他CPU对应的数组元素;另一方面,它可以自由地读取和修改自己的元素,而不必担心竞争条件,因为它是唯一有权这样做的CPU。然而,这也意味着每个 CPU 的变量只能在特定情况下使用——基本上,当在系统的 CPU 之间逻辑分割数据有意义时。
A CPU should not access the elements of the array corresponding to the other CPUs; on the other hand, it can freely read and modify its own element without fear of race conditions, because it is the only CPU entitled to do so. This also means, however, that the per-CPU variables can be used only in particular cases—basically, when it makes sense to logically split the data across the CPUs of the system.
每个 CPU 数组的元素在主内存中对齐,以便每个数据结构位于硬件高速缓存的不同行上(请参阅第 2 章中的“硬件高速缓存”部分)。因此,对每个 CPU 阵列的并发访问不会导致缓存行监听和失效,而这些操作对于系统性能而言是代价高昂的。
The elements of the per-CPU array are aligned in main memory so that each data structure falls on a different line of the hardware cache (see the section "Hardware Cache" in Chapter 2). Therefore, concurrent accesses to the per-CPU array do not result in cache line snooping and invalidation, which are costly operations in terms of system performance.
虽然每 CPU 变量可以防止多个 CPU 的并发访问,但它们不能防止异步函数(中断处理程序和可延迟函数)的访问。在这些情况下,需要额外的同步原语。
While per-CPU variables provide protection against concurrent accesses from several CPUs, they do not provide protection against accesses from asynchronous functions (interrupt handlers and deferrable functions). In these cases, additional synchronization primitives are required.
此外,每个 CPU 的变量很容易出现由内核抢占引起的竞争条件,无论是在单处理器还是多处理器系统中。作为一般规则,内核控制路径应访问每个 CPU 变量并禁用内核抢占。例如,考虑一下,如果内核控制路径获取每个 CPU 变量的本地副本的地址,然后它被抢占并移动到另一个 CPU,会发生什么:该地址仍然引用前一个 CPU 的元素。
Furthermore, per-CPU variables are prone to race conditions caused by kernel preemption , both in uniprocessor and multiprocessor systems. As a general rule, a kernel control path should access a per-CPU variable with kernel preemption disabled. Just consider, for instance, what would happen if a kernel control path gets the address of its local copy of a per-CPU variable, and then it is preempted and moved to another CPU: the address still refers to the element of the previous CPU.
表 5-3 列出了内核提供的用于使用每个 CPU 变量的主要函数和宏。
Table 5-3 lists the main functions and macros offered by the kernel to use per-CPU variables.
表 5-3。每个 CPU 变量的函数和宏
Table 5-3. Functions and macros for the per-CPU variables
宏或函数名称 Macro or function name | 描述 Description |
|---|---|
| 静态分配一个名为数据结构 Statically allocates a
per-CPU array called |
|
Selects the element for CPU |
__get_cpu_var(名称) _ _get_cpu_var(name) | 选择每 CPU 数组的本地 CPU 元素 Selects the local CPU's element of
the per-CPU array |
| 禁用内核抢占,然后选择每 CPU 数组的本地 CPU 元素 Disables kernel preemption, then
selects the local CPU's element of the per-CPU array |
| 启用内核抢占( Enables kernel preemption ( |
| 动态分配每个 CPU 的 Dynamically allocates a per-CPU
array of |
| 在地址处释放动态分配的每 CPU 数组 Releases a dynamically allocated
per-CPU array at address |
|
Returns the address of the element
for CPU |
一些汇编语言指令属于“读取-修改-写入”类型,也就是说,它们访问内存位置两次,第一次读取旧值,第二次写入新值。
Several assembly language instructions are of type "read-modify-write" — that is, they access a memory location twice, the first time to read the old value and the second time to write a new value.
假设在两个 CPU 上运行的两个内核控制路径尝试通过执行非原子操作同时“读取-修改-写入”同一内存位置。起初,两个 CPU 都尝试读取相同的位置,但内存仲裁器(串行化对 RAM 芯片的访问的硬件电路)介入以授予对其中一个芯片的访问权限并延迟另一个。然而,当第一个读取操作完成时,延迟的 CPU 从内存位置读取完全相同的(旧的)值。然后,两个 CPU 都会尝试将相同的(新的)值写入内存位置;再次,总线内存访问由内存仲裁器串行化,最终两次写操作都成功。但是,全局结果不正确,因为两个 CPU 写入相同的(新)值。因此,两个交错的“读-修改-写”操作充当单个操作。
Suppose that two kernel control paths running on two CPUs try to "read-modify-write" the same memory location at the same time by executing nonatomic operations. At first, both CPUs try to read the same location, but the memory arbiter (a hardware circuit that serializes accesses to the RAM chips) steps in to grant access to one of them and delay the other. However, when the first read operation has completed, the delayed CPU reads exactly the same (old) value from the memory location. Both CPUs then try to write the same (new) value to the memory location; again, the bus memory access is serialized by the memory arbiter, and eventually both write operations succeed. However, the global result is incorrect because both CPUs write the same (new) value. Thus, the two interleaving "read-modify-write" operations act as a single one.
防止“读-修改-写”指令引起的竞争条件的最简单方法是确保此类操作在芯片级别是原子的。每个此类操作必须在单个指令中执行,中间不能中断,并避免其他 CPU 访问同一内存位置。这些非常小的原子操作可以在其他更灵活的机制的基础上找到来创建关键区域。
The easiest way to prevent race conditions due to "read-modify-write" instructions is by ensuring that such operations are atomic at the chip level. Every such operation must be executed in a single instruction without being interrupted in the middle and avoiding accesses to the same memory location by other CPUs. These very small atomic operations can be found at the base of other, more flexible mechanisms to create critical regions.
让我们回顾一下 80×86 指令的分类:
Let's review 80×86 Instructions According To That classification:
进行零或一对齐内存访问的汇编语言指令是原子的。[ * ]
Assembly language instructions that make zero or one aligned memory access are atomic.[*]
如果在读取之后和写入之前没有其他处理器占用内存总线,则从内存读取数据、更新数据并将更新后的值写回内存的读取-修改-写入汇编语言指令(例如 或 )是原子指令
inc。dec在单处理器系统中永远不会发生内存总线窃取。
Read-modify-write assembly language instructions (such as
inc or dec) that read data from memory, update
it, and write the updated value back to memory are atomic if no
other processor has taken the memory bus after the read and before
the write. Memory bus stealing never happens in a uniprocessor
system.
即使在多处理器系统上,其操作码以字节lock
( ) 为前缀的读-修改-写汇编语言指令也是原子的。0xf0当控制单元检测到前缀时,它会“锁定”内存总线,直到指令完成。因此,在执行锁定的指令时,其他处理器无法访问该内存位置。
Read-modify-write assembly language instructions whose
opcode is prefixed by the lock
byte (0xf0) are atomic even on
a multiprocessor system. When the control unit detects the prefix,
it "locks" the memory bus until the instruction is finished.
Therefore, other processors cannot access the memory location
while the locked instruction is being executed.
其操作码以字节为前缀
rep(0xf2, 0xf3,强制控制单元多次重复同一指令)的汇编语言指令不是原子的。控制单元在执行新的迭代之前检查待处理的中断。
Assembly language instructions whose opcode is prefixed by a
rep byte (0xf2, 0xf3, which forces the control unit to
repeat the same instruction several times) are not atomic. The
control unit checks for pending interrupts before executing a new
iteration.
当您编写 C 代码时,您不能保证编译器将使用原子指令来执行类似a=a+1或什至 for 的操作a++。因此,Linux 内核提供了一种特殊atomic_t类型(原子可访问计数器)以及一些作用于变量并作为单个原子汇编语言指令实现的特殊函数和宏(参见表5-4atomic_t ) 。在多处理器系统上,每个这样的指令都以一个
lock字节为前缀。
When you write C code, you cannot guarantee that the compiler
will use an atomic instruction for an operation like a=a+1 or even for a++. Thus, the Linux kernel provides a
special atomic_t type (an
atomically accessible counter) and some special functions and macros
(see Table 5-4) that
act on atomic_t variables and are
implemented as single, atomic assembly language instructions. On
multiprocessor systems, each such instruction is prefixed by a
lock byte.
表 5-4。Linux 中的原子操作
Table 5-4. Atomic operations in Linux
功能 Function | 描述 Description |
|---|---|
| 返回 Return |
| 设置 Set |
| 添加 Add |
| 减去 Subtract |
| 如果结果为零,则减去并返回 Subtract |
| 加 1 到 Add 1 to |
| 减去 1 Subtract 1 from |
| 减去 1, Subtract 1 from |
|
Add 1 to |
| 相加 Add |
| 加 1 Add 1 to |
| 减去 1 Subtract 1 from |
| 添加 Add |
| 减去 Subtract |
另一类原子函数对位掩码进行操作(参见表 5-5)。
Another class of atomic functions operate on bit masks (see Table 5-5).
表 5-5。Linux 中的原子位处理函数
Table 5-5. Atomic bit handling functions in Linux
功能 Function | 描述 Description |
|---|---|
|
Return the value of the |
| 设置第 Set the |
| 清除第 Clear the |
| 反转第 Invert the |
| 设置第 Set the |
| 清除 Clear the |
| 反转第 Invert the |
|
Clear all bits of |
|
Set all bits of |
使用优化编译器时,您永远不应该理所当然地认为指令将按照它们在源代码中出现的确切顺序执行。例如,编译器可能会以优化寄存器使用方式的方式重新排序汇编语言指令。此外,现代 CPU 通常并行执行多个指令,并且可能会重新排序内存访问。这些类型的重新排序可以大大加快程序速度。
When using optimizing compilers, you should never take for granted that instructions will be performed in the exact order in which they appear in the source code. For example, a compiler might reorder the assembly language instructions in such a way to optimize how registers are used. Moreover, modern CPUs usually execute several instructions in parallel and might reorder memory accesses. These kinds of reordering can greatly speed up the program.
然而,在处理同步时,必须避免对指令重新排序。如果放置在同步原语之后的指令在同步原语本身之前执行,事情很快就会变得棘手。因此,所有同步原语都充当优化和内存屏障 。
When dealing with synchronization, however, reordering instructions must be avoided. Things would quickly become hairy if an instruction placed after a synchronization primitive is executed before the synchronization primitive itself. Therefore, all synchronization primitives act as optimization and memory barriers .
优化屏障原语确保编译器不会将与放置在原语之前的 C 语句相对应的汇编语言指令与与放置在原语之后的 C 语句相对应的汇编语言指令混合。在 Linux 中barrier(
),宏扩展为asm
volatile("":::"memory"),充当优化屏障。该asm指令告诉编译器插入一个汇编语言片段(在本例中为空)。该volatile关键字禁止编译器将asm
指令与程序的其他指令重新组合。这memory关键字强制编译器假设 RAM 中的所有内存位置都已被汇编语言指令更改;因此,编译器无法使用指令之前存储在CPU寄存器中的内存位置的值来优化代码asm。请注意,优化屏障不能确保汇编语言指令的执行不会被 CPU 混合——这是内存屏障的工作。
An optimization barrier primitive ensures
that the assembly language instructions corresponding to C statements
placed before the primitive are not mixed by the compiler with
assembly language instructions corresponding to C statements placed
after the primitive. In Linux the barrier(
) macro, which expands into asm
volatile("":::"memory"), acts as an optimization barrier.
The asm instruction tells the
compiler to insert an assembly language fragment (empty, in this
case). The volatile keyword forbids
the compiler to reshuffle the asm
instruction with the other instructions of the program. The memory keyword forces the compiler to assume
that all memory locations in RAM have been changed by the assembly
language instruction; therefore, the compiler cannot optimize the code
by using the values of memory locations stored in CPU registers before
the asm instruction. Notice that
the optimization barrier does not ensure that the executions of the
assembly language instructions are not mixed by the CPU—this is a job
for a memory barrier.
内存屏障原语确保在开始放置在原语之后的操作之前完成放置在原语之前的操作。因此,内存屏障就像汇编语言指令无法通过的防火墙。
A memory barrier primitive ensures that the operations placed before the primitive are finished before starting the operations placed after the primitive. Thus, a memory barrier is like a firewall that cannot be passed by an assembly language instruction.
在 80×86 处理器中,以下类型的汇编语言指令被称为“串行化”,因为它们充当内存屏障:
In the 80×86 processors, the following kinds of assembly language instructions are said to be "serializing" because they act as memory barriers:
所有在 I/O 端口上操作的指令
All instructions that operate on I/O ports
所有指令均以字节为前缀(请参阅“原子操作lock”部分)
All instructions prefixed by the lock byte (see the section "Atomic
Operations")
All instructions that write into control registers, system
registers, or debug registers (for instance, cli and sti
, which change the status of the IF flag in the eflags register)
这lfence ,sfence
, 和mfence
汇编语言指令,已在 Pentium 4 微处理器中引入,分别有效地实现读内存屏障、写内存屏障和读写内存屏障。
The lfence , sfence
, and mfence
assembly language instructions, which have been
introduced in the Pentium 4 microprocessor to efficiently
implement read memory barriers, write memory barriers, and
read-write memory barriers, respectively.
A few special assembly language instructions; among them,
the iret instruction that terminates an interrupt or
exception handler
Linux 使用一些内存屏障原语,如
表 5-6所示。这些原语也充当优化障碍,因为我们必须确保编译器不会在屏障周围移动汇编语言指令。“读内存屏障”仅作用于从内存读取的指令,而“写内存屏障”仅作用于写入内存的指令。内存屏障在多处理器和单处理器系统中都很有用。smp_xxx( )每当内存屏障应防止仅在多处理器系统中可能发生的竞争条件时,就会使用这些原语;在单处理器系统中,它们什么也不做。其他内存屏障用于防止单处理器和多处理器系统中发生竞争条件。
Linux uses a few memory barrier primitives, which are shown in
Table 5-6. These
primitives act also as optimization barriers , because we must make sure the compiler does not move
the assembly language instructions around the barrier. "Read memory
barriers" act only on instructions that read from memory, while "write
memory barriers" act only on instructions that write to memory. Memory
barriers can be useful in both multiprocessor and uniprocessor
systems. The smp_xxx( ) primitives
are used whenever the memory barrier should prevent race conditions
that might occur only in multiprocessor systems; in uniprocessor
systems, they do nothing. The other memory barriers are used to
prevent race conditions occurring both in uniprocessor and
multiprocessor systems.
表 5-6。Linux 中的内存屏障
Table 5-6. Memory barriers in Linux
宏 Macro | 描述 Description |
|---|---|
| MP 和 UP 的内存屏障 Memory barrier for MP and UP |
| MP 和 UP 的读内存屏障 Read memory barrier for MP and UP |
| MP 和 UP 的写内存屏障 Write memory barrier for MP and UP |
| 仅限 MP 的内存屏障 Memory barrier for MP only |
| 仅适用于 MP 的读取内存屏障 Read memory barrier for MP only |
| 仅针对 MP 写入内存屏障 Write memory barrier for MP only |
内存屏障原语的实现取决于系统的体系结构。在 80×86 微处理器上,
rmb( )宏通常会扩展为
asm volatile("lfence")CPU 是否支持lfence汇编语言指令,否则扩展为asm
volatile("lock;addl $0,0(%%esp)":::"memory")其他语言。该
asm语句在编译器生成的代码中插入汇编语言片段,并充当优化屏障。汇编lock; addl
$0,0(%%esp)语言指令将零添加到堆栈顶部的内存位置;该指令本身是无用的,但lock前缀使该指令成为 CPU 的内存屏障。
The implementations of the memory barrier primitives depend on
the architecture of the system. On an 80×86 microprocessor, the
rmb( ) macro usually expands into
asm volatile("lfence") if the CPU
supports the lfence assembly
language instruction, or into asm
volatile("lock;addl $0,0(%%esp)":::"memory") otherwise. The
asm statement inserts an assembly
language fragment in the code generated by the compiler and acts as an
optimization barrier. The lock; addl
$0,0(%%esp) assembly language instruction adds zero to the
memory location on top of the stack; the instruction is useless by
itself, but the lock prefix makes
the instruction a memory barrier for the CPU.
该wmb( )宏实际上更简单,因为它扩展为barrier(
). 这是因为现有的英特尔微处理器从不重新排序写入内存访问,因此无需在代码中插入序列化汇编语言指令。然而,该宏禁止编译器打乱指令。
The wmb( ) macro is actually
simpler because it expands into barrier(
). This is because existing Intel microprocessors never
reorder write memory accesses, so there is no need to insert a
serializing assembly language instruction in the code. The macro,
however, forbids the compiler from shuffling the instructions.
请注意,在多处理器系统中,前面部分“原子操作”中描述的所有原子操作都充当内存屏障,因为它们使用lock字节。
Notice that in multiprocessor systems, all atomic operations
described in the earlier section "Atomic Operations" act as
memory barriers because they use the lock byte.
一种广泛使用的同步技术是 锁定。当内核控制路径必须访问共享数据结构或进入临界区时,它需要为其获取“锁”。受锁定机制保护的资源与限制在房间内的资源非常相似,当有人在里面时,门就会被锁上。如果内核控制路径希望访问资源,它会尝试通过获取锁来“打开门”。仅当资源空闲时它才会成功。然后,只要它想使用该资源,门就保持锁定状态。当内核控制路径释放锁时,门被解锁,另一个内核控制路径可以进入房间。
A widely used synchronization technique is locking. When a kernel control path must access a shared data structure or enter a critical region, it needs to acquire a "lock" for it. A resource protected by a locking mechanism is quite similar to a resource confined in a room whose door is locked when someone is inside. If a kernel control path wishes to access the resource, it tries to "open the door" by acquiring the lock. It succeeds only if the resource is free. Then, as long as it wants to use the resource, the door remains locked. When the kernel control path releases the lock, the door is unlocked and another kernel control path may enter the room.
图 5-1 说明了锁的使用。五个内核控制路径(P0、P1、P2、P3 和 P4)正在尝试访问两个关键区域(C1 和 C2)。内核控制路径P0在C1内部,而P2和P4正在等待进入其中。同时,P1在C2内部,而P3正在等待进入C2。请注意,P0 和 P1 可以同时运行。临界区 C3 的锁是打开的,因为没有内核控制路径需要进入它。
Figure 5-1 illustrates the use of locks. Five kernel control paths (P0, P1, P2, P3, and P4) are trying to access two critical regions (C1 and C2). Kernel control path P0 is inside C1, while P2 and P4 are waiting to enter it. At the same time, P1 is inside C2, while P3 is waiting to enter it. Notice that P0 and P1 could run concurrently. The lock for critical region C3 is open because no kernel control path needs to enter it.
自旋锁是一种特殊的锁,设计用于在多处理器环境中工作。如果内核控制路径发现自旋锁“打开”,它将获取该锁并继续执行。相反,如果内核控制路径发现锁被另一个 CPU 上运行的内核控制路径“关闭”,它就会“旋转”,重复执行紧密的指令循环,直到锁被释放。
Spin locks are a special kind of lock designed to work in a multiprocessor environment. If the kernel control path finds the spin lock "open," it acquires the lock and continues its execution. Conversely, if the kernel control path finds the lock "closed" by a kernel control path running on another CPU, it "spins" around, repeatedly executing a tight instruction loop, until the lock is released.
自旋锁的指令循环代表“忙等待”。等待的内核控制路径继续在CPU上运行,即使它除了浪费时间之外没有任何事情可做。尽管如此,自旋锁通常很方便,因为许多内核资源仅被锁定几分之一毫秒;因此,释放CPU并稍后重新获取它会更加耗时。
The instruction loop of spin locks represents a "busy wait." The waiting kernel control path keeps running on the CPU, even if it has nothing to do besides waste time. Nevertheless, spin locks are usually convenient, because many kernel resources are locked for a fraction of a millisecond only; therefore, it would be far more time-consuming to release the CPU and reacquire it later.
作为一般规则,在受自旋锁保护的每个关键区域中都会禁用内核抢占。在单处理器系统的情况下,锁本身是无用的,自旋锁原语只是禁用或启用内核抢占。请注意,在繁忙等待阶段,内核抢占仍然启用,因此等待自旋锁释放的进程可能会被更高优先级的进程替换。
As a general rule, kernel preemption is disabled in every critical region protected by spin locks. In the case of a uniprocessor system, the locks themselves are useless, and the spin lock primitives just disable or enable the kernel preemption. Please notice that kernel preemption is still enabled during the busy wait phase, thus a process waiting for a spin lock to be released could be replaced by a higher priority process.
在Linux中,每个自旋锁都由一个由两个字段组成的结构体表示spinlock_t:
In Linux, each spin lock is represented by a spinlock_t structure consisting of two
fields:
slockslock对自旋锁状态进行编码:值1对应于解锁状态,而每个负值和0表示锁定状态
Encodes the spin lock state: the value 1 corresponds to the unlocked state, while every negative value and 0 denote the locked state
break_lockbreak_lock表示进程正忙于等待锁的标志(仅当内核同时支持 SMP 和内核抢占时才出现)
Flag signaling that a process is busy waiting for the lock (present only if the kernel supports both SMP and kernel preemption)
表 5-7中所示的六个宏用于初始化、测试和设置自旋锁。所有这些宏都基于原子操作;这可以确保即使在不同 CPU 上运行的多个进程同时尝试修改锁时,自旋锁也能正确更新。[ * ]
Six macros shown in Table 5-7 are used to initialize, test, and set spin locks. All these macros are based on atomic operations; this ensures that the spin lock will be updated properly even when multiple processes running on different CPUs try to modify the lock at the same time.[*]
表 5-7。自旋锁宏
Table 5-7. Spin lock macros
宏 Macro | 描述 Description |
|---|---|
| 将自旋锁设置为1(解锁) Set the spin lock to 1 (unlocked) |
| 循环直到自旋锁变为1(解锁),然后将其设置为0(锁定) Cycle until spin lock becomes 1 (unlocked), then set it to 0 (locked) |
| 将自旋锁设置为1(解锁) Set the spin lock to 1 (unlocked) |
| 等待直到自旋锁变为1(解锁) Wait until the spin lock becomes 1 (unlocked) |
| 如果自旋锁设置为1(解锁),则返回0;1 否则 Return 0 if the spin lock is set to 1 (unlocked); 1 otherwise |
spin_trylock() spin_trylock( ) | 将自旋锁设置为0(锁定),如果锁之前的值为1则返回1;否则为 0 Set the spin lock to 0 (locked), and return 1 if the previous value of the lock was 1; 0 otherwise |
让我们详细讨论一下spin_lock用于获取自旋锁的宏。以下描述涉及SMP系统的抢占式内核。该宏以自旋锁的地址slp作为参数,执行以下操作:
Let's discuss in detail the spin_lock macro, which is used to acquire
a spin lock. The following description refers to a preemptive kernel
for an SMP system. The macro takes the address slp of the spin lock as its parameter and
executes the following actions:
调用preempt_disable(
)以禁用内核抢占。
Invokes preempt_disable(
) to disable kernel preemption.
调用该_raw_spin_trylock(
)函数,该函数对自旋锁的字段执行原子测试和设置操作slock;该函数首先执行一些相当于以下汇编语言片段的指令:
movb $0, %al
xchgb %al,slp->slock这xchg 汇编语言指令以原子方式交换 8 位%al寄存器(存储零)的内容与 指向的内存位置的内容slp->slock。%al如果存储在自旋锁(指令之后)中的旧值
xchg是正数,则该函数返回值 1,否则返回值 0。
Invokes the _raw_spin_trylock(
) function, which does an atomic test-and-set
operation on the spin lock's slock field; this function executes
first some instructions equivalent to the following assembly
language fragment:
movb $0, %al
xchgb %al, slp->slockThe xchg assembly language instruction exchanges
atomically the content of the 8-bit %al register (storing zero) with the
content of the memory location pointed to by slp->slock. The function then
returns the value 1 if the old value stored in the spin lock (in
%al after the xchg instruction) was positive, the
value 0 otherwise.
如果自旋锁的旧值为正,则宏终止:内核控制路径已获取自旋锁。
If the old value of the spin lock was positive, the macro terminates: the kernel control path has acquired the spin lock.
否则,内核控制路径无法获取自旋锁,因此宏必须循环,直到自旋锁被运行在其他 CPU 上的内核控制路径释放。调用preempt_enable( )以撤消步骤 1 中完成的抢占计数器的增加。如果在执行宏之前启用了内核抢占spin_lock,则另一个进程现在可以在等待自旋锁时替换该进程。
Otherwise, the kernel control path failed in acquiring the
spin lock, thus the macro must cycle until the spin lock is
released by a kernel control path running on some other CPU.
Invokes preempt_enable( ) to
undo the increase of the preemption counter done in step 1. If
kernel preemption was enabled before executing the spin_lock macro, another process can
now replace this process while it waits for the spin
lock.
如果该break_lock字段等于 0,则将其设置为 1。通过检查该字段,拥有锁并运行在另一个CPU上的进程可以获知是否有其他进程在等待该锁。如果一个进程长时间持有自旋锁,它可能会决定提前释放它,以允许等待同一自旋锁的另一个进程继续进行。
If the break_lock field
is equal to zero, sets it to one. By checking this field, the
process owning the lock and running on another CPU can learn
whether there are other processes waiting for the lock. If a
process holds a spin lock for a long time, it may decide to
release it prematurely to allow another process waiting for the
same spin lock to progress.
执行等待周期:
while (spin_is_locked(slp) && slp->break_lock)
cpu_relax();宏cpu_relax( )简化为pause 汇编语言指令。该指令已在 Pentium 4 模型中引入,以优化自旋锁循环的执行。通过引入短延迟,它可以加快锁后代码的执行速度并降低功耗。该pause
指令向后兼容早期型号的 80×86 微处理器,因为它对应于指令
rep;nop,即无操作。
Executes the wait cycle:
while (spin_is_locked(slp) && slp->break_lock)
cpu_relax();The cpu_relax( ) macro
reduces to a pause assembly language instruction. This instruction
has been introduced in the Pentium 4 model to optimize the
execution of spin lock loops. By introducing a short delay, it
speeds up the execution of code following the lock and reduces
power consumption. The pause
instruction is backward compatible with earlier models of 80×86
microprocessors because it corresponds to the instruction
rep;nop, that is, to a
no-operation.
跳回步骤 1 再次尝试获取自旋锁。
Jumps back to step 1 to try once more to get the spin lock.
如果编译内核时没有选择内核抢占选项,则该spin_lock宏与上述宏有很大不同。在这种情况下,宏生成一个汇编语言片段,本质上相当于以下紧忙等待:[ * ]
If the kernel preemption option has not been selected
when the kernel was compiled, the spin_lock macro is quite different from
the one described above. In this case, the macro yields a assembly
language fragment that is essentially equivalent to the following
tight busy wait:[*]
1:锁;decb slp->slock
jns 3f
2:暂停
cmpb $0,slp->slock
杰勒2b
跳转1b
3: 1: lock; decb slp->slock
jns 3f
2: pause
cmpb $0,slp->slock
jle 2b
jmp 1b
3:汇编decb语言指令减少自旋锁值;该指令是原子的,因为它以lock字节为前缀。然后对符号标志执行测试。如果很清楚,则意味着自旋锁被设置为1(解锁),因此正常执行在标签处继续3(f后缀表示该标签是“前向”标签;它出现在程序的后面一行)。2否则,执行标签(后缀表示“向后”标签)处的紧密循环b,直到自旋锁呈现正值。然后从 label 重新开始执行1,因为在不检查另一个处理器是否已获取锁的情况下继续执行是不安全的。
The decb assembly language
instruction decreases the spin lock value; the instruction is atomic
because it is prefixed by the lock byte. A test is then performed on the
sign flag. If it is clear, it means that the spin lock was set to 1
(unlocked), so normal execution continues at label 3 (the f suffix denotes the fact that the label
is a "forward" one; it appears in a later line of the program).
Otherwise, the tight loop at label 2 (the b suffix denotes a "backward" label) is
executed until the spin lock assumes a positive value. Then
execution restarts from label 1,
since it is unsafe to proceed without checking whether another
processor has grabbed the lock.
该spin_unlock
宏释放先前获取的自旋锁;它本质上执行汇编语言指令:
The spin_unlock
macro releases a previously acquired spin lock; it essentially
executes the assembly language instruction:
movb $1, slp->slock
movb $1, slp->slock
然后调用preempt_enable(
)(如果不支持内核抢占,preempt_enable( )则不执行任何操作)。lock请注意,未使用该字节,因为当前 80×86 微处理器始终以原子方式执行内存中的只写访问。
and then invokes preempt_enable(
) (if kernel preemption is not supported, preempt_enable( ) does nothing). Notice
that the lock byte is not used
because write-only accesses in memory are always atomically executed
by the current 80×86 microprocessors.
引入读/写自旋锁来增加内核内部的并发量。它们允许多个内核控制路径同时读取相同的数据结构,只要没有内核控制路径修改它。如果内核控制路径希望写入该结构,则它必须获取读/写锁的写入版本,该锁授予对资源的独占访问权限。当然,允许并发读取数据结构可以提高系统性能。
Read/write spin locks have been introduced to increase the amount of concurrency inside the kernel. They allow several kernel control paths to simultaneously read the same data structure, as long as no kernel control path modifies it. If a kernel control path wishes to write to the structure, it must acquire the write version of the read/write lock, which grants exclusive access to the resource. Of course, allowing concurrent reads on data structures improves system performance.
图 5-2 说明了受读/写锁保护的两个关键区域(C1 和 C2)。内核控制路径R0和R1同时读取C1中的数据结构,而W0则等待获取写入锁。内核控制路径W1正在写入C2中的数据结构,而R2和W2分别等待获取读取和写入的锁。
Figure 5-2 illustrates two critical regions (C1 and C2) protected by read/write locks. Kernel control paths R0 and R1 are reading the data structures in C1 at the same time, while W0 is waiting to acquire the lock for writing. Kernel control path W1 is writing the data structures in C2, while both R2 and W2 are waiting to acquire the lock for reading and writing, respectively.
每个读/写自旋锁都是一个rwlock_t结构体;它的lock字段是一个 32 位字段,编码两个不同的信息:
Each read/write spin lock is a rwlock_t structure; its lock field is a 32-bit field that encodes
two distinct pieces of information:
24 位计数器,表示当前读取受保护数据结构的内核控制路径的数量。该计数器的二进制补码值存储在该字段的位 0-23 中。
A 24-bit counter denoting the number of kernel control paths currently reading the protected data structure. The two's complement value of this counter is stored in bits 0–23 of the field.
当没有内核控制路径正在读取或写入时设置解锁标志,否则清除。该解锁标志存储在该字段的位 24 中。
An unlock flag that is set when no kernel control path is reading or writing, and clear otherwise. This unlock flag is stored in bit 24 of the field.
请注意,如果自旋锁处于空闲状态(设置了解锁标志并且没有读取器),则该字段
lock存储该数字;如果已获取该数字用于写入(清除解锁标志并且没有读取器),则该字段存储该数字,以及序列中的任何数字,等等亮,如果它已被一个、两个或多个进程读取(解锁标志清除以及读取器数量的 24 位上的二进制补码)。作为该
结构体,该结构体还包括字段。0x010000000x000000000x00ffffff0x00fffffespinlock_trwlock_tbreak_lock
Notice that the lock field
stores the number 0x01000000 if the
spin lock is idle (unlock flag set and no readers), the number
0x00000000 if it has been acquired
for writing (unlock flag clear and no readers), and any number in the
sequence 0x00ffffff, 0x00fffffe, and so on, if it has been
acquired for reading by one, two, or more processes (unlock flag clear
and the two's complement on 24 bits of the number of readers). As the
spinlock_t structure, the rwlock_t structure also includes a break_lock field.
该rwlock_init宏将lock读/写自旋锁的字段初始化为0x01000000
(解锁)并将该break_lock字段初始化为零。
The rwlock_init macro
initializes the lock field of a
read/write spin lock to 0x01000000
(unlocked) and the break_lock field
to zero.
该read_lock
宏应用于读/写自旋锁的地址,与上一节中描述的宏rwlp类似。spin_lock如果在编译内核时选择了内核抢占选项,则宏将执行与 的操作完全相同的操作spin_lock(
),只有一个例外:为了在步骤 2 中有效获取读/写自旋锁,宏将执行以下函数_raw_read_trylock( ):
The read_lock
macro, applied to the address rwlp of a read/write spin lock, is similar
to the spin_lock macro described
in the previous section. If the kernel preemption option has been
selected when the kernel was compiled, the macro performs the very
same actions as those of spin_lock(
), with just one exception: to effectively acquire the
read/write spin lock in step 2, the macro executes the _raw_read_trylock( ) function:
int _raw_read_trylock(rwlock_t *lock)
{
atomic_t *count = (atomic_t *)lock->lock;
原子十进制(计数);
if (atomic_read(count) >= 0)
返回1;
原子增量(计数);
返回0;
} int _raw_read_trylock(rwlock_t *lock)
{
atomic_t *count = (atomic_t *)lock->lock;
atomic_dec(count);
if (atomic_read(count) >= 0)
return 1;
atomic_inc(count);
return 0;
}该lock字段(读/写锁计数器)通过原子操作来访问。但请注意,整个函数不会以原子方式作用于计数器:例如,在使用语句测试其值之后if
和返回 1 之前,计数器可能会发生变化。尽管如此,该函数仍可以正常工作:事实上,该函数返回 1仅当计数器在递减之前不为零或负数时,因为0x01000000对于无所有者、
0x00ffffff一名读者和
一名写入者来说,计数器等于0x00000000。
The lock field—the
read/write lock counter—is accessed by means of atomic operations.
Notice, however, that the whole function does not act atomically on
the counter: for instance, the counter might change after having
tested its value with the if
statement and before returning 1. Nevertheless, the function works
properly: in fact, the function returns 1 only if the counter was
not zero or negative before the decrement, because the counter is
equal to 0x01000000 for no owner,
0x00ffffff for one reader, and
0x00000000 for one writer.
如果编译内核时未选择内核抢占选项,则该read_lock宏将生成以下汇编语言代码:
If the kernel preemption option has not been selected when the
kernel was compiled, the read_lock macro yields the following
assembly language code:
movl $rwlp->lock,%eax
锁; subl $1,(%eax)
jns 1f
调用__read_lock_failed
1: movl $rwlp->lock,%eax
lock; subl $1,(%eax)
jns 1f
call _ _read_lock_failed
1:其中_ _read_lock_failed(
)是以下汇编语言函数:
where _ _read_lock_failed(
) is the following assembly language function:
__read_lock_failed:
锁; 包括 (%eax)
1:暂停
cmpl $1,(%eax)
js 1b
锁; 声明 (%eax)
js _ _read_lock_failed
雷特 _ _read_lock_failed:
lock; incl (%eax)
1: pause
cmpl $1,(%eax)
js 1b
lock; decl (%eax)
js _ _read_lock_failed
ret该read_lock宏自动地将自旋锁值减 1,从而增加读取器的数量。如果递减操作产生非负值,则获取自旋锁;否则,_ _read_lock_failed( )调用该函数。该函数以原子方式增加lock字段以撤消宏执行的减量操作read_lock,然后循环直到字段变为正值(大于或等于 1)。接下来,_ _read_lock_failed(
)尝试再次获取自旋锁(另一个内核控制路径可以在
cmpl指令之后立即获取用于写入的自旋锁)。
The read_lock macro
atomically decreases the spin lock value by 1, thus increasing the
number of readers. The spin lock is acquired if the decrement
operation yields a nonnegative value; otherwise, the _ _read_lock_failed( ) function is
invoked. The function atomically increases the lock field to undo the decrement operation
performed by the read_lock macro,
and then loops until the field becomes positive (greater than or
equal to 1). Next, _ _read_lock_failed(
) tries to get the spin lock again (another kernel control
path could acquire the spin lock for writing right after the
cmpl instruction).
释放读锁非常简单,因为宏必须使用汇编语言指令read_unlock简单地增加字段中的计数器:lock
Releasing the read lock is quite simple, because the read_unlock macro must simply increase the
counter in the lock field with
the assembly language instruction:
锁; 包括 rwlp->lock
lock; incl rwlp->lock
减少读取器的数量,然后调用preempt_enable( )以重新启用内核抢占。
to decrease the number of readers, and then invoke preempt_enable( ) to reenable kernel
preemption.
该宏的实现方式与和write_lock
相同。例如,如果支持内核抢占,该函数将禁用内核抢占并尝试通过调用立即获取锁。如果该函数返回 0,则说明已获取锁,因此该宏将重新启用内核抢占并启动繁忙的等待循环,如上一节的描述中所述。spin_lock( )read_lock( )_raw_write_trylock( )spin_lock( )
The write_lock
macro is implemented in the same way as spin_lock( ) and read_lock( ). For instance, if kernel
preemption is supported, the function disables kernel preemption and
tries to grab the lock right away by invoking _raw_write_trylock( ). If this function
returns 0, the lock was already taken, thus the macro reenables
kernel preemption and starts a busy wait loop, as explained in the
description of spin_lock( ) in
the previous section.
函数_raw_write_trylock( )
如下图所示:
The _raw_write_trylock( )
function is shown below:
int _raw_write_trylock(rwlock_t *lock)
{
atomic_t *count = (atomic_t *)lock->lock;
if (atomic_sub_and_test(0x01000000, 计数))
返回1;
atomic_add(0x01000000, 计数);
返回0;
} int _raw_write_trylock(rwlock_t *lock)
{
atomic_t *count = (atomic_t *)lock->lock;
if (atomic_sub_and_test(0x01000000, count))
return 1;
atomic_add(0x01000000, count);
return 0;
}该函数
从读/写自旋锁值中_raw_write_trylock( )
减去,从而清除解锁标志(位 24)。0x01000000如果减法运算结果为零(没有读取器),则获取锁并且函数返回 1;否则,该函数会自动添加0x01000000自旋锁值以撤消减法操作。
The _raw_write_trylock( )
function subtracts 0x01000000
from the read/write spin lock value, thus clearing the unlock flag
(bit 24). If the subtraction operation yields zero (no readers), the
lock is acquired and the function returns 1; otherwise, the function
atomically adds 0x01000000 to the
spin lock value to undo the subtraction operation.
再次,释放写锁要简单得多,因为宏必须简单地使用汇编语言指令write_unlock在字段中设置解锁标志:lock
Once again, releasing the write lock is much simpler because
the write_unlock macro must
simply set the unlock flag in the lock field with the assembly language
instruction:
锁; addl $0x01000000,rwlp
lock; addl $0x01000000,rwlp
然后调用preempt_enable().
and then invoke preempt_enable().
当使用读/写自旋锁时,内核控制路径发出的执行一个read_lock或多个write_lock操作的请求具有相同的优先级:读取器必须等待写入器完成,同样,写入器必须等待所有读取器完成。
When using read/write spin locks, requests issued by
kernel control paths to perform a read_lock or a write_lock operation have the same priority:
readers must wait until the writer has finished and, similarly, a
writer must wait until all readers have finished.
Linux 2.6 中引入的Seqlock与读/写自旋锁类似,不同之处在于它们为写入者提供了更高的优先级:事实上,即使读取者处于活动状态,也允许写入者继续进行。这种策略的好处是,作家永远不会等待(除非另一个作家很活跃);不好的部分是,读者有时可能被迫多次读取相同的数据,直到获得有效的副本。
Seqlocks introduced in Linux 2.6 are similar to read/write spin locks, except that they give a much higher priority to writers: in fact a writer is allowed to proceed even when readers are active. The good part of this strategy is that a writer never waits (unless another writer is active); the bad part is that a reader may sometimes be forced to read the same data several times until it gets a valid copy.
每个 seqlock 都是一个seqlock_t
由两个字段组成的结构:一个lock类型字段spinlock_t和一个整数sequence字段。第二个字段起着序列计数器的作用。每个读取器必须在读取数据之前和之后读取该序列计数器两次,并检查两个值是否一致。在相反的情况下,新的写入器已变为活动状态并增加了序列计数器,从而隐式地告诉读取器刚刚读取的数据无效。
Each seqlock is a seqlock_t
structure consisting of two fields: a lock field of type spinlock_t and an integer sequence field. This second field plays the
role of a sequence counter. Each reader must read this sequence
counter twice, before and after reading the data, and check whether
the two values coincide. In the opposite case, a new writer has become
active and has increased the sequence counter, thus implicitly telling
the reader that the data just read is not valid.
seqlock_t通过向变量赋值SEQLOCK_UNLOCKED或执行宏,
将变量初始化为“解锁” seqlock_init。write_seqlock( )编写者通过调用和来获取和释放 seqlock write_sequnlock( )。第一个函数获取seqlock_t数据结构中的自旋锁,然后将序列计数器加一;第二个函数再次增加序列计数器,然后释放自旋锁。这确保了当写入器正在写入时,计数器为奇数,而当没有写入器正在更改数据时,计数器为偶数。读者按如下方式实现关键区域:
A seqlock_t variable is
initialized to "unlocked" either by assigning to it the value SEQLOCK_UNLOCKED, or by executing the
seqlock_init macro. Writers acquire
and release a seqlock by invoking write_seqlock( ) and write_sequnlock( ). The first function
acquires the spin lock in the seqlock_t data structure, then increases the
sequence counter by one; the second function increases the sequence
counter once more, then releases the spin lock. This ensures that when
the writer is in the middle of writing, the counter is odd, and that
when no writer is altering data, the counter is even. Readers
implement a critical region as follows:
无符号整型序列;
做 {
seq = read_seqbegin(&seqlock);
/* ... 关键区域 ... */
while (read_seqretry(&seqlock, seq)); unsigned int seq;
do {
seq = read_seqbegin(&seqlock);
/* ... CRITICAL REGION ... */
} while (read_seqretry(&seqlock, seq));read_seqbegin()返回 seqlock 的当前序列号;read_seqretry()如果seq局部变量的值为奇数(写入程序在read_seqbegin( )调用该函数时正在更新数据结构),或者如果 的值seq与 seqlock 的序列计数器的当前值不匹配(写入程序在调用该函数时开始工作),则返回 1读者仍在关键区域执行代码)。
read_seqbegin() returns the
current sequence number of the seqlock; read_seqretry() returns 1 if either the
value of the seq local variable is
odd (a writer was updating the data structure when the read_seqbegin( ) function has been invoked),
or if the value of seq does not
match the current value of the seqlock's sequence counter (a writer
started working while the reader was still executing the code in the
critical region).
请注意,当读取器进入临界区时,不需要禁用内核抢占;另一方面,写入者在进入临界区时会自动禁用内核抢占,因为它获取了自旋锁。
Notice that when a reader enters a critical region, it does not need to disable kernel preemption; on the other hand, the writer automatically disables kernel preemption when entering the critical region, because it acquires the spin lock.
并非每种数据结构都可以受到 seqlock 的保护。作为一般规则,必须满足以下条件:
Not every kind of data structure can be protected by a seqlock. As a general rule, the following conditions must hold:
要保护的数据结构不包括由编写者修改并由读者取消引用的指针(否则,编写者可以在读者的眼皮子底下更改指针)
The data structure to be protected does not include pointers that are modified by the writers and dereferenced by the readers (otherwise, a writer could change the pointer under the nose of the readers)
读取器关键区域中的代码不会产生副作用(否则多次读取会产生与单次读取不同的效果)
The code in the critical regions of the readers does not have side effects (otherwise, multiple reads would have different effects from a single read)
此外,读取器的关键区域应该很短,并且写入器应该很少获取seqlock,否则重复的读取访问将导致严重的开销。Linux 2.6 中 seqlock 的典型用法包括保护一些与系统时间处理相关的数据结构(参见第 6 章)。
Furthermore, the critical regions of the readers should be short and writers should seldom acquire the seqlock, otherwise repeated read accesses would cause a severe overhead. A typical usage of seqlocks in Linux 2.6 consists of protecting some data structures related to the system time handling (see Chapter 6).
读复制更新 ( RCU ) 是另一种同步技术,旨在保护主要由多个 CPU 进行读取访问的数据结构。RCU 允许许多读取器和许多写入器同时进行(对 seqlock 的改进,后者只允许一个写入器继续进行)。而且,RCU是无锁的,也就是说,它不使用所有CPU共享的锁或计数器;与读/写自旋锁和序列锁相比,这是一个很大的优势,后者由于缓存行监听和失效而具有很高的开销。
Read-copy update (RCU) is yet another synchronization technique designed to protect data structures that are mostly accessed for reading by several CPUs. RCU allows many readers and many writers to proceed concurrently (an improvement over seqlocks, which allow only one writer to proceed). Moreover, RCU is lock-free, that is, it uses no lock or counter shared by all CPUs; this is a great advantage over read/write spin locks and seqlocks, which have a high overhead due to cache line-snooping and invalidation.
RCU 如何在没有共享数据结构的情况下获得同步多个 CPU 的惊人结果?关键思想包括限制 RCU 的范围,如下所示:
How does RCU obtain the surprising result of synchronizing several CPUs without shared data structures? The key idea consists of limiting the scope of RCU as follows:
只有动态分配并通过指针引用的数据结构才能受 RCU 保护。
Only data structures that are dynamically allocated and referenced by means of pointers can be protected by RCU.
任何内核控制路径都不能在受 RCU 保护的关键区域内休眠。
No kernel control path can sleep inside a critical region protected by RCU.
当内核控制路径想要读取受 RCU 保护的数据结构时,它会执行该rcu_read_lock(
)宏,相当于preempt_disable( ). 接下来,读取器取消引用数据结构的指针并开始读取它。如上所述,读取器在读取完数据结构之前不能休眠;临界区的结束由
rcu_read_unlock( )宏标记,相当于preempt_enable(
).
When a kernel control path wants to read an RCU-protected data
structure, it executes the rcu_read_lock(
) macro, which is equivalent to preempt_disable( ). Next, the reader
dereferences the pointer to the data structure and starts reading it.
As stated above, the reader cannot sleep until it finishes reading the
data structure; the end of the critical region is marked by the
rcu_read_unlock( ) macro, which is
equivalent to preempt_enable(
).
因为读者在防止竞争条件方面做得很少,所以我们可以预期作者必须多做一些工作。事实上,当编写者想要更新数据结构时,它会取消引用指针并复制整个数据结构。接下来,作者修改文案。完成后,编写器更改指向数据结构的指针,使其指向更新后的副本。由于更改指针的值是一项原子操作,因此每个读取器或写入器都会看到旧副本或新副本:不会发生数据结构损坏。然而,需要内存屏障来确保只有在数据结构被修改之后其他CPU才能看到更新的指针。
Because the reader does very little to prevent race conditions, we could expect that the writer has to work a bit more. In fact, when a writer wants to update the data structure, it dereferences the pointer and makes a copy of the whole data structure. Next, the writer modifies the copy. Once finished, the writer changes the pointer to the data structure so as to make it point to the updated copy. Because changing the value of the pointer is an atomic operation, each reader or writer sees either the old copy or the new one: no corruption in the data structure may occur. However, a memory barrier is required to ensure that the updated pointer is seen by the other CPUs only after the data structure has been modified. Such a memory barrier is implicitly introduced if a spin lock is coupled with RCU to forbid the concurrent execution of writers.
然而,RCU 技术的真正问题是,当写入者更新指针时,无法立即释放数据结构的旧副本。事实上,当写入器开始更新时正在访问数据结构的读取器仍然可以读取旧副本。仅当 CPU 上的所有(潜在)读取器都执行了rcu_read_unlock( )宏后,才能释放旧副本。内核要求每个潜在的读者先执行该宏:
The real problem with the RCU technique, however, is that the
old copy of the data structure cannot be freed right away when the
writer updates the pointer. In fact, the readers that were accessing
the data structure when the writer started its update could still be
reading the old copy. The old copy can be freed only after all
(potential) readers on the CPUs have executed the rcu_read_unlock( ) macro. The kernel
requires every potential reader to execute that macro before:
CPU执行进程切换(参见前面的限制2)。
The CPU performs a process switch (see restriction 2 earlier).
CPU 开始在用户模式下执行。
The CPU starts executing in User Mode.
The CPU executes the idle loop (see the section "Kernel Threads" in Chapter 3).
在每种情况下,我们都说 CPU 已经经历了 静止状态。
In each of these cases, we say that the CPU has gone through a quiescent state.
作者调用该call_rcu( )函数来删除数据结构的旧副本。它接收描述符的地址rcu_head(通常嵌入要释放的数据结构内)和
当所有 CPU 都经历静止状态时要调用的回调函数的地址作为其参数。一旦执行,回调函数通常会释放数据结构的旧副本。
The call_rcu( ) function is
invoked by the writer to get rid of the old copy of the data
structure. It receives as its parameters the address of an rcu_head descriptor (usually embedded inside
the data structure to be freed) and the address of a
callback function to be invoked when all CPUs
have gone through a quiescent state. Once executed, the callback
function usually frees the old copy of the data structure.
该call_rcu( )函数将回调的地址及其参数存储在rcu_head描述符中,然后将该描述符插入到每个 CPU 的回调列表中。定期地,每个时钟周期(请参阅第 6 章中的“更新本地 CPU 统计信息”部分),内核都会检查本地 CPU 是否已进入静止状态。当所有 CPU 都进入静止状态时,本地 tasklet(其描述符存储在每个 CPU 变量中)将执行列表中的所有回调。rcu_tasklet
The call_rcu( ) function
stores in the rcu_head descriptor
the address of the callback and its parameter, then inserts the
descriptor in a per-CPU list of callbacks. Periodically, once every
tick (see the section "Updating Local CPU
Statistics" in Chapter
6), the kernel checks whether the local CPU has gone through a
quiescent state. When all CPUs have gone through a quiescent state, a
local tasklet—whose descriptor is stored in the rcu_tasklet per-CPU variable—executes all
callbacks in the list.
RCU是Linux 2.6中的新增内容;它用于网络层和虚拟文件系统。
RCU is a new addition in Linux 2.6; it is used in the networking layer and in the Virtual Filesystem.
我们已经介绍了信号量 在第 1 章的“同步和关键区域”部分中。本质上,它们实现了一个锁定原语,允许服务员休眠,直到所需的资源空闲为止。
We have already introduced semaphores in the section "Synchronization and Critical Regions" in Chapter 1. Essentially, they implement a locking primitive that allows waiters to sleep until the desired resource becomes free.
实际上,Linux 提供了两种信号量:
Actually, Linux offers two kinds of semaphores:
内核信号量,由内核控制路径使用
Kernel semaphores, which are used by kernel control paths
System V IPC 信号量,由用户模式进程使用
System V IPC semaphores, which are used by User Mode processes
在本节中,我们重点讨论内核信号量,而 IPC 信号量将在第 19 章中描述。
In this section, we focus on kernel semaphores, while IPC semaphores are described in Chapter 19.
内核信号量类似于自旋锁,因为除非锁打开,否则它不允许内核控制路径继续进行。然而,每当内核控制路径尝试获取受内核信号量保护的繁忙资源时,相应的进程就会被挂起。当资源被释放后,它再次可以运行。因此,内核信号量只能由允许休眠的函数获取;中断处理程序和可延迟函数不能使用它们。
A kernel semaphore is similar to a spin lock, in that it doesn't allow a kernel control path to proceed unless the lock is open. However, whenever a kernel control path tries to acquire a busy resource protected by a kernel semaphore, the corresponding process is suspended. It becomes runnable again when the resource is released. Therefore, kernel semaphores can be acquired only by functions that are allowed to sleep; interrupt handlers and deferrable functions cannot use them.
内核信号量是 类型的对象struct semaphore,包含以下列表中显示的字段。
A kernel semaphore is an object of type struct semaphore, containing the fields
shown in the following list.
countcount存储一个atomic_t
值。如果它大于 0,则资源是空闲的,即当前可用。如果count等于 0,则信号量正忙,但没有其他进程正在等待受保护的资源。最后,如果count为负数,则该资源不可用,并且至少有一个进程正在等待该资源。
Stores an atomic_t
value. If it is greater than 0, the resource is free — that is,
it is currently available. If count is equal to 0, the semaphore is
busy but no other process is waiting for the protected resource.
Finally, if count is
negative, the resource is unavailable and at least one process
is waiting for it.
waitwait存储等待队列列表的地址,其中包括当前正在等待资源的所有休眠进程。当然,如果count大于等于0,则等待队列为空。
Stores the address of a wait queue list that includes all
sleeping processes that are currently waiting for the resource.
Of course, if count is
greater than or equal to 0, the wait queue is empty.
sleeperssleepers存储一个标志,指示某些进程是否正在信号量上休眠。我们很快就会看到这个领域投入运行。
Stores a flag that indicates whether some processes are sleeping on the semaphore. We'll see this field in operation soon.
和函数可用于初始化信号量以进行独占访问:它们
分别将该字段设置为 1(具有独占访问权限的空闲资源)和 0(当前授予初始化信号量的进程的具有独占访问权限的繁忙资源)init_MUTEX( )。
和宏的
作用相同,但它们也静态分配变量。请注意,信号量也可以用任意正值
n进行初始化。此时最多允许n个进程同时访问该资源。init_MUTEX_LOCKED( )countDECLARE_MUTEXDECLARE_MUTEX_LOCKEDstruct semaphorecount
The init_MUTEX( ) and
init_MUTEX_LOCKED( ) functions may
be used to initialize a semaphore for exclusive access: they set the
count field to 1 (free resource
with exclusive access) and 0 (busy resource with exclusive access
currently granted to the process that initializes the semaphore),
respectively. The DECLARE_MUTEX and
DECLARE_MUTEX_LOCKED macros do the
same, but they also statically allocate the struct semaphore variable. Note that a
semaphore could also be initialized with an arbitrary positive value
n for count.
In this case, at most n processes are allowed to
concurrently access the resource.
我们首先讨论如何释放信号量,这比获取信号量简单得多。当进程希望释放内核信号量锁时,它会调用该up( )函数。该函数本质上等价于以下汇编语言片段:
Let's start by discussing how to release a semaphore,
which is much simpler than getting one. When a process wishes to
release a kernel semaphore lock, it invokes the up( ) function. This function is
essentially equivalent to the following assembly language
fragment:
movl $sem->count,%ecx
锁; 包括 (%ecx)
1f
%ecx,%eax
推 %edx
推 %ecx
给...打电话
人口%ecx
人口 %edx
1: movl $sem->count,%ecx
lock; incl (%ecx)
jg 1f
lea %ecx,%eax
pushl %edx
pushl %ecx
call _ _up
popl %ecx
popl %edx
1:其中_ _up( )是以下 C 函数:
where _ _up( ) is the
following C function:
__attribute__((regparm(3))) void __up(结构信号量 *sem)
{
wake_up(&sem->等待);
} __attribute__((regparm(3))) void _ _up(struct semaphore *sem)
{
wake_up(&sem->wait);
}该up( )函数增加信号count量的字段
*sem,然后检查其值是否大于0。
count以下跳转指令测试的标志的增加和设置必须以原子方式执行,否则另一个内核控制路径可以同时访问场价值,带来灾难性的结果。如果count大于 0,则表示等待队列中没有休眠进程,因此无需执行任何操作。否则,_ _up( )调用该函数以唤醒一个休眠进程。请注意,
_ _up( )从eax寄存器接收其参数(请参阅_ _switch_to(
)第 3 章“执行进程切换”部分中的函数)。
The up( ) function
increases the count field of the
*sem semaphore, and then it
checks whether its value is greater than 0. The increment of
count and the setting of the flag
tested by the following jump instruction must be atomically
executed, or else another kernel control path could concurrently
access the field value, with disastrous results. If count is greater than 0, there was no
process sleeping in the wait queue, so nothing has to be done.
Otherwise, the _ _up( ) function
is invoked so that one sleeping process is woken up. Notice that
_ _up( ) receives its parameter
from the eax register (see the
description of the _ _switch_to(
) function in the section "Performing the Process
Switch" in Chapter
3).
相反,当进程希望获取内核信号量锁时,它会调用该down(
)函数。的实现down( )相当复杂,但本质上等价于以下内容:
Conversely, when a process wishes to acquire a kernel
semaphore lock, it invokes the down(
) function. The implementation of down( ) is quite involved, but it is
essentially equivalent to the following:
向下:
movl $sem->count,%ecx
锁; 声明(%ecx);
jns 1f
%ecx, %eax
推 %edx
推 %ecx
呼叫_ _down
人口%ecx
人口 %edx
1: down:
movl $sem->count,%ecx
lock; decl (%ecx);
jns 1f
lea %ecx, %eax
pushl %edx
pushl %ecx
call _ _down
popl %ecx
popl %edx
1:其中_ _down( )是以下 C 函数:
where _ _down( ) is the
following C function:
__attribute__((regparm(3))) void __down(结构信号量 * sem)
{
DECLARE_WAITQUEUE(等待,当前);
无符号长标志;
当前->状态 = TASK_UNINTERRUPTIBLE;
spin_lock_irqsave(&sem->wait.lock, flags);
add_wait_queue_exclusive_locked(&sem->wait, &wait);
SEM->睡眠者++;
为了 (;;) {
if (!atomic_add_negative(sem->sleepers-1, &sem->count)) {
sem->睡眠者 = 0;
休息;
}
sem->睡眠者 = 1;
spin_unlock_irqrestore(&sem->wait.lock, flags);
日程( );
spin_lock_irqsave(&sem->wait.lock, flags);
当前->状态 = TASK_UNINTERRUPTIBLE;
}
remove_wait_queue_locked(&sem->等待, &wait);
wake_up_locked(&sem->等待);
spin_unlock_irqrestore(&sem->wait.lock, flags);
当前->状态 = TASK_RUNNING;
} __attribute__((regparm(3))) void _ _down(struct semaphore * sem)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long flags;
current->state = TASK_UNINTERRUPTIBLE;
spin_lock_irqsave(&sem->wait.lock, flags);
add_wait_queue_exclusive_locked(&sem->wait, &wait);
sem->sleepers++;
for (;;) {
if (!atomic_add_negative(sem->sleepers-1, &sem->count)) {
sem->sleepers = 0;
break;
}
sem->sleepers = 1;
spin_unlock_irqrestore(&sem->wait.lock, flags);
schedule( );
spin_lock_irqsave(&sem->wait.lock, flags);
current->state = TASK_UNINTERRUPTIBLE;
}
remove_wait_queue_locked(&sem->wait, &wait);
wake_up_locked(&sem->wait);
spin_unlock_irqrestore(&sem->wait.lock, flags);
current->state = TASK_RUNNING;
}该down( )函数减小信号count量的字段
*sem,然后检查其值是否为负。同样,递减和测试必须以原子方式执行。如果count大于等于0,则当前进程获取资源并正常继续执行。否则,count为负,当前进程必须暂停。一些寄存器的内容保存在堆栈中,然后_ _down( )被调用。
The down( ) function
decreases the count field of the
*sem semaphore, and then checks
whether its value is negative. Again, the decrement and the test
must be atomically executed. If count is greater than or equal to 0, the
current process acquires the resource and the execution continues
normally. Otherwise, count is
negative, and the current process must be suspended. The contents of
some registers are saved on the stack, and then _ _down( ) is invoked.
本质上,该_ _down( )
函数将当前进程的状态从 更改为TASK_RUNNING,TASK_UNINTERRUPTIBLE并将该进程放入信号量等待队列中。在访问该semaphore结构体的字段之前,该函数还会获取sem->wait.lock保护信号量等待队列的自旋锁(请参阅第 3 章中的“进程是如何组织的” )并禁用本地中断。通常,等待队列函数在插入和删除元素时根据需要获取和释放等待队列自旋锁。然而,该函数还使用等待队列自旋锁来保护该队列的其他字段。_
_down( )semaphore数据结构,以便在另一个 CPU 上运行的任何进程都无法读取或修改它们。为此,_ _down( )使用_locked等待队列函数的“”版本,该函数假定在调用之前已经获取了自旋锁。
Essentially, the _ _down( )
function changes the state of the current process from TASK_RUNNING to TASK_UNINTERRUPTIBLE, and it puts the
process in the semaphore wait queue. Before accessing the fields of
the semaphore structure, the
function also gets the sem->wait.lock spin lock that protects
the semaphore wait queue (see "How Processes Are
Organized" in Chapter
3) and disables local interrupts. Usually, wait queue
functions get and release the wait queue spin lock as necessary when
inserting and deleting an element. The _
_down( ) function, however, uses the wait queue spin lock
also to protect the other fields of the semaphore data structure, so that no
process running on another CPU is able to read or modify them. To
that end, _ _down( ) uses the
"_locked" versions of the wait
queue functions, which assume that the spin lock has been already
acquired before their invocations.
该_ _down(
)函数的主要任务是挂起当前进程,直到信号量被释放。然而,完成此操作的方式相当复杂。为了容易理解代码,请记住,
sleepers如果没有进程在信号量的等待队列中休眠,则信号量的字段通常设置为 0,否则设置为 1。让我们尝试通过考虑几个典型案例来解释代码。
The main task of the _ _down(
) function is to suspend the current process until the
semaphore is released. However, the way in which this is done is
quite involved. To easily understand the code, keep in mind that the
sleepers field of the semaphore
is usually set to 0 if no process is sleeping in the wait queue of
the semaphore, and it is set to 1 otherwise. Let's try to explain
the code by considering a few typical cases.
count 等于 1,
sleepers 等于 0)count equal to 1,
sleepers equal to
0)该down宏只是将该count字段设置为0并跳转到主程序的下一条指令;因此,该_ _down( )
函数根本没有被执行。
The down macro just
sets the count field to 0
and jumps to the next instruction of the main program;
therefore, the _ _down( )
function is not executed at all.
count
等于 0,sleepers 等于 0)count
equal to 0, sleepers equal to
0)该down宏减少count并调用该_ _down( )函数,并将该count字段设置为 -1 并将该sleepers字段设置为 0。在循环的每次迭代中,该函数都会检查该count字段是否为负数。(请注意,调用该函数时,该count字段不会更改,
atomic_add_negative( )
因为sleepers它等于 0。)
如果该count
字段为负数,则调用该函数schedule( )来挂起当前进程。该count字段仍设置为 -1,该sleepers字段设置为 1。该进程随后在此循环内继续运行并再次发出测试。
如果该count
字段不为负,则该函数设置sleepers为 0 并退出循环。它尝试唤醒信号量等待队列中的另一个进程(但在我们的场景中,队列现在为空)并终止持有信号量。退出时,该
count字段和该
sleepers字段都设置为 0,这是信号量关闭但没有进程在等待它时所需要的。
The down macro
decreases count and invokes
the _ _down( ) function
with the count field set to
-1 and the sleepers field
set to 0. In each iteration of the loop, the function checks
whether the count field is
negative. (Observe that the count field is not changed by
atomic_add_negative( )
because sleepers is equal
to 0 when the function is invoked.)
If the count
field is negative, the function invokes schedule( ) to suspend the
current process. The count field is still set to -1,
and the sleepers field
to 1. The process picks up its run subsequently inside
this loop and issues the test again.
If the count
field is not negative, the function sets sleepers to 0 and exits from the
loop. It tries to wake up another process in the semaphore
wait queue (but in our scenario, the queue is now empty)
and terminates holding the semaphore. On exit, both the
count field and the
sleepers field are set
to 0, as required when the semaphore is closed but no
process is waiting for it.
count
等于-1,sleepers 等于1)count
equal to -1, sleepers equal to
1)宏down减少count并调用_ _down( )设置count为 -2 和
sleepers设置为 1 的函数。该函数暂时设置sleepers为 2,然后down通过将值
sleepers-1 添加到 来撤消宏执行的减量count。同时,该函数检查是否count仍为负(信号量可能在_ _down( )
进入临界区之前已被保持进程释放)。
如果该count
字段为负数,则该函数重置sleepers为 1 并调用
schedule( )以挂起当前进程。该count字段仍设置为-1,该sleepers字段设置为1。
如果该count
字段不为负,则该函数设置sleepers为 0,尝试唤醒信号量等待队列中的另一个进程,并退出保持信号量。退出时,该count字段设置为 0,并且该
sleepers字段设置为 0。这两个字段的值看起来都不正确,因为还有其他休眠进程。但是,请考虑等待队列中的另一个进程已被唤醒。此过程执行循环的另一次迭代;该atomic_add_negative( )函数从 中减去 1 count,将其恢复为 -1;而且,在返回睡眠之前,唤醒进程重置sleepers为1。
The down macro
decreases count and invokes
the _ _down( ) function
with count set to -2 and
sleepers set to 1. The
function temporarily sets sleepers to 2, and then undoes the
decrement performed by the down macro by adding the value
sleepers-1 to count. At the same time, the
function checks whether count is still negative (the
semaphore could have been released by the holding process
right before _ _down( )
entered the critical region).
If the count
field is negative, the function resets sleepers to 1 and invokes
schedule( ) to suspend
the current process. The count field is still set to -1,
and the sleepers field
to 1.
If the count
field is not negative, the function sets sleepers to 0, tries to wake up
another process in the semaphore wait queue, and exits
holding the semaphore. On exit, the count field is set to 0 and the
sleepers field to 0.
The values of both fields look wrong, because there are
other sleeping processes. However, consider that another
process in the wait queue has been woken up. This process
does another iteration of the loop; the atomic_add_negative( ) function
subtracts 1 from count,
restoring it to -1; moreover, before returning to sleep,
the woken-up process resets sleepers to 1.
因此,该代码在所有情况下都可以正常工作。考虑到
wake_up( )中的函数_ _down( )最多唤醒一个进程,因为等待队列中的休眠进程是排他性的(参见第 3 章中的“进程是如何组织的”一节)。
So, the code properly works in all cases. Consider that the
wake_up( ) function in _ _down( ) wakes up at most one process,
because the sleeping processes in the wait queue are exclusive (see
the section "How
Processes Are Organized" in Chapter 3).
仅异常处理程序,特别是系统调用服务例程,可以使用该down(
)功能。中断处理程序或可延迟函数不得调用down( ),因为当信号量繁忙时此函数会挂起进程。因此,Linux 提供了该down_trylock( )函数,可以由前面提到的异步函数之一安全地使用。down(
)它与除非资源繁忙时相同。在这种情况下,函数立即返回,而不是让进程进入睡眠状态。
Only exception handlers , and particularly system call service
routines , can use the down(
) function. Interrupt handlers or deferrable functions
must not invoke down( ), because
this function suspends the process when the semaphore is busy. For
this reason, Linux provides the down_trylock( ) function, which may be
safely used by one of the previously mentioned asynchronous
functions. It is identical to down(
) except when the resource is busy. In this case, the
function returns immediately instead of putting the process to
sleep.
down_interruptible( )还定义了一个稍微不同的函数叫。它被设备驱动程序广泛使用,因为它允许在信号量上被阻塞时接收信号的进程放弃“向下”操作。如果休眠进程在获得所需资源之前被信号唤醒,则该函数会增加
count信号量的字段并返回值-EINTR。另一方面,如果down_interruptible(
)运行正常完成并获取资源,则返回 0。因此,当返回值为 时,设备驱动程序可能会中止 I/O 操作-EINTR。
A slightly different function called down_interruptible( ) is also defined. It
is widely used by device drivers, because it allows processes that
receive a signal while being blocked on a semaphore to give up the
"down" operation. If the sleeping process is woken up by a signal
before getting the needed resource, the function increases the
count field of the semaphore and
returns the value -EINTR. On the
other hand, if down_interruptible(
) runs to normal completion and gets the resource, it
returns 0. The device driver may thus abort the I/O operation when
the return value is -EINTR.
最后,由于进程通常会发现信号量处于打开状态,因此信号量函数针对这种情况进行了优化。特别是,up( )如果信号量等待队列为空,函数不会执行跳转指令;同样,down( )
如果信号量打开,该函数不会执行跳转指令。信号量实现的大部分复杂性正是由于避免在执行流的主分支中使用昂贵的指令而造成的。
Finally, because processes usually find semaphores in an open
state, the semaphore functions are optimized for this case. In
particular, the up( ) function
does not execute jump instructions if the semaphore wait queue is
empty; similarly, the down( )
function does not execute jump instructions if the semaphore is
open. Much of the complexity of the semaphore implementation is
precisely due to the effort of avoiding costly instructions in the
main branch of the execution flow.
读/写信号量与前面“读/写自旋锁”部分中描述的读/写自旋锁类似,不同之处在于等待进程被挂起而不是旋转,直到信号量再次打开。
Read/write semaphores are similar to the read/write spin locks described earlier in the section "Read/Write Spin Locks," except that waiting processes are suspended instead of spinning until the semaphore becomes open again.
许多内核控制路径可能会同时获取读/写信号量进行读取;然而,每个写入器内核控制路径必须具有对受保护资源的独占访问权。因此,仅当没有其他内核控制路径保留信号量以进行读或写访问时,才能获取信号量以进行写入。读/写信号量提高了内核内部的并发量并提高了整体系统性能。
Many kernel control paths may concurrently acquire a read/write semaphore for reading; however, every writer kernel control path must have exclusive access to the protected resource. Therefore, the semaphore can be acquired for writing only if no other kernel control path is holding it for either read or write access. Read/write semaphores improve the amount of concurrency inside the kernel and improve overall system performance.
内核按照严格的 FIFO 顺序处理等待读/写信号量的所有进程。每个发现信号量关闭的读取器或写入器都被插入信号量等待队列列表的最后位置。当信号量被释放时,将检查等待队列列表中第一个位置的进程。第一个进程总是被唤醒。如果是写入者,则等待队列中的其他进程继续休眠。如果是读取者,则队列开头的所有读取者(直到第一个写入者)也会被唤醒并获得锁。然而,排在作者后面的读者继续睡觉。
The kernel handles all processes waiting for a read/write semaphore in strict FIFO order. Each reader or writer that finds the semaphore closed is inserted in the last position of a semaphore's wait queue list. When the semaphore is released, the process in the first position of the wait queue list are checked. The first process is always awoken. If it is a writer, the other processes in the wait queue continue to sleep. If it is a reader, all readers at the start of the queue, up to the first writer, are also woken up and get the lock. However, readers that have been queued after a writer continue to sleep.
rw_semaphore每个读/写信号量由包含以下字段的结构描述:
Each read/write semaphore is described by a rw_semaphore structure that includes the
following fields:
countcount存储两个16位计数器。最高有效字中的计数器以二进制补码形式编码,形式为非等待写入器数量(0 或 1)与等待内核控制路径数量之和。次要字中的计数器对非等待读者和作者的总数进行编码。
Stores two 16-bit counters. The counter in the most significant word encodes in two's complement form the sum of the number of nonwaiting writers (either 0 or 1) and the number of waiting kernel control paths. The counter in the less significant word encodes the total number of nonwaiting readers and writers.
wait_listwait_list指向等待进程的列表。该列表中的每个元素都是一个rwsem_waiter
结构,包括指向睡眠进程描述符的指针和指示进程是否需要信号量进行读取或写入的标志。
Points to a list of waiting processes. Each element in
this list is a rwsem_waiter
structure, including a pointer to the descriptor of the sleeping
process and a flag indicating whether the process wants the
semaphore for reading or for writing.
wait_lockwait_lock自旋锁用于保护等待队列列表和
rw_semaphore结构本身。
A spin lock used to protect the wait queue list and the
rw_semaphore structure
itself.
该函数
通过将
字段设置为 0、将自旋锁设置为解锁并设置为空列表来init_rwsem( )初始化结构。rw_semaphorecountwait_lockwait_list
The init_rwsem( ) function
initializes an rw_semaphore
structure by setting the count
field to 0, the wait_lock spin lock
to unlocked, and wait_list to the
empty list.
和函数分别获取用于读取和写入down_read( )的down_write( )读/写信号量。类似地,up_read( )和up_write( )函数释放先前获取的用于读取和写入的读/写信号量。和
功能分别与和类似,但如果信号量繁忙,它们不会阻塞进程down_read_trylock( )。最后,该函数以原子方式将写锁转换为读锁。这五个函数的实现很长,但很容易理解,因为它类似于普通信号量的实现;因此,我们避免描述它们。down_write_trylock( )down_read( )down_write( )downgrade_write( )
The down_read( ) and down_write( ) functions acquire the
read/write semaphore for reading and writing, respectively. Similarly,
the up_read( ) and up_write( ) functions release a read/write
semaphore previously acquired for reading and for writing. The
down_read_trylock( ) and down_write_trylock( ) functions are similar
to down_read( ) and down_write( ), respectively, but they do not
block the process if the semaphore is busy. Finally, the downgrade_write( ) function atomically
transforms a write lock into a read lock. The implementation of these
five functions is long, but easy to follow because it resembles the
implementation of normal semaphores; therefore, we avoid describing
them.
Linux 2.6 还使用了另一个类似于信号量的同步原语:完成
。引入它们是为了解决多处理器系统中发生的微妙竞争条件,当进程 A 分配临时信号量变量,将其初始化为封闭的 MUTEX,将其地址传递给进程 B,然后调用它时down( )。进程A计划一唤醒就销毁信号量。随后,运行在不同 CPU 上的进程 B 调用信号量up( )。然而,在当前的实现中up( ),并且down(
)可以在同一个信号量上并发执行。这样,当进程B仍在执行该函数时,进程A可以被唤醒并销毁临时信号量up(
)。因此,up(
)可能会尝试访问不再存在的数据结构。
Linux 2.6 also makes use of another synchronization
primitive similar to semaphores: completions
. They have been introduced to solve a subtle race
condition that occurs in multiprocessor systems when process A
allocates a temporary semaphore variable, initializes it as closed
MUTEX, passes its address to process B, and then invokes down( ) on it. Process A plans to destroy
the semaphore as soon as it awakens. Later on, process B running on a
different CPU invokes up( ) on the
semaphore. However, in the current implementation up( ) and down(
) can execute concurrently on the same semaphore. Thus,
process A can be woken up and destroy the temporary semaphore while
process B is still executing the up(
) function. As a result, up(
) might attempt to access a data structure that no longer
exists.
当然,可以更改同一信号量的实现
down( )并up( )禁止同一信号量的并发执行。然而,这种改变需要额外的指令,这对于频繁使用的函数来说是一件坏事。
Of course, it is possible to change the implementation of
down( ) and up( ) to forbid concurrent executions on the
same semaphore. However, this change would require additional
instructions, which is a bad thing to do for functions that are so
heavily used.
完成是专门为解决此问题而设计的同步原语。数据completion结构包括等待队列头和标志:
The completion is a synchronization primitive that is
specifically designed to solve this problem. The completion data structure includes a wait
queue head and a flag:
结构完成{
无符号整数完成;
wait_queue_head_t等待;
}; struct completion {
unsigned int done;
wait_queue_head_t wait;
};对应的函数up(
)称为complete( )。completion它接收数据结构的地址作为参数,调用spin_lock_irqsave( )完成等待队列的自旋锁,增加字段done,唤醒休眠在wait等待队列中的独占进程,最后调用spin_unlock_irqrestore(
).
The function corresponding to up(
) is called complete( ).
It receives as an argument the address of a completion data structure, invokes spin_lock_irqsave( ) on the spin lock of the
completion's wait queue, increases the done field, wakes up the exclusive process
sleeping in the wait wait queue,
and finally invokes spin_unlock_irqrestore(
).
对应的函数down(
)称为wait_for_completion(
)。它接收数据结构的地址作为参数completion并检查标志的值done。如果它大于零,wait_for_completion(
)则终止,因为complete(
)已经在另一个 CPU 上执行。否则,该函数将current作为独占进程添加到等待队列的尾部,并进入current睡眠TASK_UNINTERRUPTIBLE状态。一旦被唤醒,该函数就会current从等待队列中删除。然后,该函数检查标志的值done:如果它等于0,则函数终止,否则,当前进程再次挂起。就像函数的情况一样complete(
),wait_for_completion(
)利用完成等待队列中的自旋锁。
The function corresponding to down(
) is called wait_for_completion(
). It receives as an argument the address of a completion data structure and checks the
value of the done flag. If it is
greater than zero, wait_for_completion(
) terminates, because complete(
) has already been executed on another CPU. Otherwise, the
function adds current to the tail
of the wait queue as an exclusive process and puts current to sleep in the TASK_UNINTERRUPTIBLE state. Once woken up,
the function removes current from
the wait queue. Then, the function checks the value of the done flag: if it is equal to zero the
function terminates, otherwise, the current process is suspended
again. As in the case of the complete(
) function, wait_for_completion(
) makes use of the spin lock in the completion's wait
queue.
完成和信号量之间的真正区别在于如何使用等待队列中包含的自旋锁。在完成中,使用自旋锁来确保complete(
)不能wait_for_completion(
)并发执行。在信号量中,自旋锁用于避免并发down(
)函数弄乱semaphore数据结构。
The real difference between completions and semaphores is how
the spin lock included in the wait queue is used. In completions, the
spin lock is used to ensure that complete(
) and wait_for_completion(
) cannot execute concurrently. In semaphores, the spin lock
is used to avoid letting concurrent down(
)'s functions mess up the semaphore data structure.
中断禁止是用于确保一系列内核语句被视为关键部分的关键机制之一。即使硬件设备发出 IRQ 信号,它也允许内核控制路径继续执行,从而提供一种有效的方法来保护也由中断处理程序访问的数据结构。然而,本地中断禁用本身并不能防止其他 CPU 上运行的中断处理程序对数据结构的并发访问,因此在多处理器系统中,本地中断禁用通常与自旋锁结合在一起(请参阅后面的“同步对内核数据的访问”部分)结构”)。
Interrupt disabling is one of the key mechanisms used to ensure that a sequence of kernel statements is treated as a critical section. It allows a kernel control path to continue executing even when hardware devices issue IRQ signals, thus providing an effective way to protect data structures that are also accessed by interrupt handlers. By itself, however, local interrupt disabling does not protect against concurrent accesses to data structures by interrupt handlers running on other CPUs, so in multiprocessor systems, local interrupt disabling is often coupled with spin locks (see the later section "Synchronizing Accesses to Kernel Data Structures").
该local_irq_disable( )
宏使用了cli
汇编语言指令,禁用本地 CPU 上的中断。该local_irq_enable(
)宏使用了sti 汇编语言指令使它们成为可能。如第 4章“ IRQ 和中断”一节所述,和汇编语言指令分别清除和设置
中断标志位。clistiIFeflags 控制寄存器。如果寄存器的标志被清除,则宏irqs_disabled( )产生值“1” ;如果设置了标志,则宏产生值“1”。IFeflags
The local_irq_disable( )
macro, which makes use of the cli
assembly language instruction, disables interrupts on
the local CPU. The local_irq_enable(
) macro, which makes use of the of the sti assembly language instruction, enables them. As stated
in the section "IRQs and
Interrupts" in Chapter
4, the cli and sti assembly language instructions,
respectively, clear and set the IF
flag of the eflags control register. The irqs_disabled( ) macro yields the value one
if the IF flag of the eflags register is clear, the value one if
the flag is set.
当内核进入临界区时,它通过清除寄存器IF的标志来禁用中断eflags。但在关键部分的末尾,内核通常不能简单地再次设置标志。中断可以以嵌套方式执行,因此内核不一定知道IF当前控制路径执行之前的标志是什么。在这些情况下,控制路径必须保存标志的旧设置并在最后恢复该设置。
When the kernel enters a critical section, it disables
interrupts by clearing the IF flag
of the eflags register. But at the
end of the critical section, often the kernel can't simply set the
flag again. Interrupts can execute in nested fashion, so the kernel
does not necessarily know what the IF flag was before the current control path
executed. In these cases, the control path must save the old setting
of the flag and restore that setting at the end.
内容的保存和恢复分别eflags通过
local_irq_save和local_irq_restore宏来实现。宏
local_irq_save将寄存器的内容复制eflags到局部变量中;然后该IF标志由汇编语言指令清除cli。在临界区末尾,宏
local_irq_restore恢复 的原始内容eflags;因此,只有在该控制路径发出汇编语言指令之前启用中断,才会启用中断cli
。
Saving and restoring the eflags content is achieved by means of the
local_irq_save and local_irq_restore macros, respectively. The
local_irq_save macro copies the
content of the eflags register into
a local variable; the IF flag is
then cleared by a cli assembly
language instruction. At the end of the critical region, the macro
local_irq_restore restores the
original content of eflags;
therefore, interrupts are enabled only if they were enabled before
this control path issued the cli
assembly language instruction.
在第 4 章的“软中断”部分中,我们解释了可延迟函数可以在不可预测的时间执行(本质上是在硬件中断处理程序终止时)。因此,必须保护可延迟函数访问的数据结构免受竞争条件的影响。
In the section "Softirqs" in Chapter 4, we explained that deferrable functions can be executed at unpredictable times (essentially, on termination of hardware interrupt handlers). Therefore, data structures accessed by deferrable functions must be protected against race conditions.
禁止 CPU 上执行可延迟函数的一个简单方法是禁用该 CPU 上的中断。由于无法激活中断处理程序,因此无法异步启动软中断操作。
A trivial way to forbid deferrable functions execution on a CPU is to disable interrupts on that CPU. Because no interrupt handler can be activated, softirq actions cannot be started asynchronously.
然而,正如我们将在下一节中看到的,内核有时需要禁用可延迟函数而不禁用中断。通过作用于 的描述符preempt_count字段
中存储的软中断计数器,可以在本地 CPU 上启用或禁用本地可延迟功能。currentthread_info
As we'll see in the next section, however, the kernel sometimes
needs to disable deferrable functions without disabling interrupts. Local deferrable functions can be enabled
or disabled on the local CPU by acting on the softirq counter stored
in the preempt_count field of the
current's thread_info descriptor.
回想一下,do_softirq( )
如果软中断计数器为正,则该函数永远不会执行软中断。此外,微线程是在软中断之上实现的,因此将此计数器设置为正值会禁用给定 CPU 上所有可延迟函数的执行,而不仅仅是软中断。
Recall that the do_softirq( )
function never executes the softirqs if the softirq counter is
positive. Moreover, tasklets are implemented on top of softirqs, so
setting this counter to a positive value disables the execution of all
deferrable functions on a given CPU, not just softirqs.
宏local_bh_disable将本地 CPU 的软中断计数器加 1,而local_bh_enable( )函数则将其减 1。因此,内核可以使用多个嵌套调用
local_bh_disable;local_bh_enable只有与第一次调用匹配的宏
才能再次启用可延迟函数local_bh_disable。
The local_bh_disable macro
adds one to the softirq counter of the local CPU, while the local_bh_enable( ) function subtracts one
from it. The kernel can thus use several nested invocations of
local_bh_disable; deferrable
functions will be enabled again only by the local_bh_enable macro matching the first
local_bh_disable invocation.
减少软中断计数器后,local_bh_enable( )执行两个重要操作,有助于确保及时执行长时间等待的线程:
After having decreased the softirq counter, local_bh_enable( ) performs two important
operations that help to ensure timely execution of long-waiting
threads:
preempt_count检查本地CPU字段中的hardirq计数器和softirq计数器
;如果它们都为零并且有待执行的软中断,则调用以激活它们(请参阅第 4 章中的“软中断”do_softirq(
)部分)。
Checks the hardirq counter and the softirq counter in the
preempt_count field of the
local CPU; if both of them are zero and there are pending softirqs
to be executed, invokes do_softirq(
) to activate them (see the section "Softirqs" in Chapter 4).
检查本地CPU的flag是否TIF_NEED_RESCHED设置;如果是,则进程切换请求处于待处理状态,从而调用该函数(请参阅本章前面的“内核抢占preempt_schedule( )”
部分)。
Checks whether the TIF_NEED_RESCHED flag of the local CPU
is set; if so, a process switch request is pending, thus invokes
the preempt_schedule( )
function (see the section "Kernel Preemption"
earlier in this chapter).
[ * ]当数据项的地址是其大小(以字节为单位)的倍数时,该数据项在内存中对齐。例如,对齐短整数的地址必须是二的倍数,而对齐整数的地址必须是四的倍数。一般来说,未对齐的内存访问不是原子的。
[*] A data item is aligned in memory when its address is a multiple of its size in bytes. For instance, the address of an aligned short integer must be a multiple of two, while the address of an aligned integer must be a multiple of four. Generally speaking, an unaligned memory access is not atomic.
[ * ]讽刺的是,自旋锁是全局的,因此必须保护其本身免受并发访问。
[*] Spin locks, ironically enough, are global and therefore must themselves be protected against concurrent accesses.
[ * ]紧忙等待循环的实际实现稍微复杂一些。标签 2 处的代码仅在自旋锁繁忙时才执行,包含在辅助部分中,以便在最常见的情况下(当自旋锁已空闲时),硬件缓存不会被那些不会执行的代码所填充。不被执行。在我们的讨论中,我们省略了这些优化细节。
[*] The actual implementation of the tight busy wait loop is slightly more complicated. The code at label 2, which is executed only if the spin lock is busy, is included in an auxiliary section so that in the most frequent case (when the spin lock is already free) the hardware cache is not filled with code that won't be executed. In our discussion, we omit these optimization details.
通过使用上一节中所示的一些同步原语,可以保护共享数据结构免受竞争条件的影响。当然,系统性能可能会有很大差异,具体取决于所选同步原语的类型。通常,内核开发人员采用以下经验法则: 始终保持并发级别在系统中尽可能高。
A shared data structure can be protected against race conditions by using some of the synchronization primitives shown in the previous section. Of course, system performance may vary considerably, depending on the kind of synchronization primitive selected. Usually, the following rule of thumb is adopted by kernel developers: always keep the concurrency level as high as possible in the system.
反过来,系统中的并发级别取决于两个主要因素:
In turn, the concurrency level in the system depends on two main factors:
同时运行的I/O设备的数量
The number of I/O devices that operate concurrently
执行生产性工作的 CPU 数量
The number of CPUs that do productive work
为了最大化 I/O 吞吐量,应该在很短的时间内禁用中断。如第 4 章“ IRQ 和中断” 部分所述,当禁用中断时,I/O 设备发出的 IRQ 会暂时被 PIC 忽略,并且此类设备上无法启动新的活动。
To maximize I/O throughput, interrupts should be disabled for very short periods of time. As described in the section "IRQs and Interrupts" in Chapter 4, when interrupts are disabled, IRQs issued by I/O devices are temporarily ignored by the PIC, and no new activity can start on such devices.
为了有效地使用 CPU,应尽可能避免基于自旋锁的同步原语。当 CPU 执行紧密的指令循环等待自旋锁打开时,就会浪费宝贵的机器周期。更糟糕的是,正如我们已经说过的,自旋锁会对系统的整体性能产生负面影响,因为它们会影响硬件缓存。
To use CPUs efficiently, synchronization primitives based on spin locks should be avoided whenever possible. When a CPU is executing a tight instruction loop waiting for the spin lock to open, it is wasting precious machine cycles. Even worse, as we have already said, spin locks have negative effects on the overall performance of the system because of their impact on the hardware caches.
让我们说明几个可以在保持高并发级别的同时实现同步的情况:
Let's illustrate a couple of cases in which synchronization can be achieved while still maintaining a high concurrency level:
由单个整数值组成的共享数据结构可以通过将其声明为类型atomic_t并使用原子操作来更新。原子操作比自旋锁和中断禁用更快,并且它仅减慢并发访问数据结构的内核控制路径。
A shared data structure consisting of a single integer value
can be updated by declaring it as an atomic_t type and by using atomic
operations. An atomic operation is faster than spin locks and
interrupt disabling, and it slows down only kernel control paths
that concurrently access the data structure.
将元素插入共享链表永远不是原子的,因为它至少包含两个指针分配。尽管如此,内核有时可以在不使用锁或禁用中断的情况下执行此插入操作。作为说明其工作原理的一个示例,我们将考虑以下情况:系统调用服务例程(请参阅第 10 章中的“系统调用处理程序和服务例程” )在单链表中插入新元素,而中断处理程序或可延迟函数异步查找列表。
在C语言中,插入是通过以下指针赋值来实现的:
新->下一个=列表元素->下一个;
列表元素->下一个=新的;在汇编语言中,插入减少为两个连续的原子指令。第一条指令设置元素next的指针new,但不修改列表。因此,如果中断处理程序看到第一条指令和第二条指令执行之间的列表,它就会看到没有新元素的列表。如果处理程序在执行第二条指令后看到列表,它会看到包含新元素的列表。重要的一点是,无论哪种情况,列表都是一致的并且处于未损坏的状态。然而,只有当中断处理程序不修改列表时,这种完整性才能得到保证。如果是这样,next则刚刚在new
元素可能会变得无效。
然而,开发人员必须确保两个赋值操作的顺序不能被编译器或CPU的控制单元破坏;否则,如果系统调用服务例程在两次分配之间被中断处理程序中断,则处理程序会发现损坏的列表。因此,需要一个写内存屏障原语:
新->下一个=列表元素->下一个;
wmb();
列表元素->下一个=新的;Inserting an element into a shared linked list is never atomic, because it consists of at least two pointer assignments. Nevertheless, the kernel can sometimes perform this insertion operation without using locks or disabling interrupts. As an example of why this works, we'll consider the case where a system call service routine (see "System Call Handler and Service Routines" in Chapter 10) inserts new elements in a singly linked list, while an interrupt handler or deferrable function asynchronously looks up the list.
In the C language, insertion is implemented by means of the following pointer assignments:
new->next = list_element->next;
list_element->next = new;In assembly language, insertion reduces to two consecutive
atomic instructions. The first instruction sets up the next pointer of the new element, but it does not modify the
list. Thus, if the interrupt handler sees the list between the
execution of the first and second instructions, it sees the list
without the new element. If the handler sees the list after the
execution of the second instruction, it sees the list with the new
element. The important point is that in either case, the list is
consistent and in an uncorrupted state. However, this integrity is
assured only if the interrupt handler does not modify the list. If
it does, the next pointer that
was just set within the new
element might become invalid.
However, developers must ensure that the order of the two assignment operations cannot be subverted by the compiler or the CPU's control unit; otherwise, if the system call service routine is interrupted by the interrupt handler between the two assignments, the handler finds a corrupted list. Therefore, a write memory barrier primitive is required:
new->next = list_element->next;
wmb( );
list_element->next = new;不幸的是,大多数内核数据结构的访问模式比刚刚显示的简单示例复杂得多,并且内核开发人员被迫使用信号量、自旋锁、中断和软中断禁用。一般来说,同步原语的选择取决于访问数据结构的内核控制路径的类型,如表5-8所示。请记住,每当内核控制路径获取自旋锁(以及读/写锁、seqlock 或 RCU“读锁”)、禁用本地中断或禁用本地软中断时,内核抢占都会自动禁用。
Unfortunately, access patterns to most kernel data structures are a lot more complex than the simple examples just shown, and kernel developers are forced to use semaphores, spin locks, interrupts, and softirq disabling. Generally speaking, choosing the synchronization primitives depends on what kinds of kernel control paths access the data structure, as shown in Table 5-8. Remember that whenever a kernel control path acquires a spin lock (as well as a read/write lock, a seqlock, or a RCU "read lock"), disables the local interrupts, or disables the local softirqs, kernel preemption is automatically disabled.
表 5-8。内核控制路径访问的数据结构所需的保护
Table 5-8. Protection required by data structures accessed by kernel control paths
访问数据结构的内核控制路径 Kernel control paths accessing the data structure | UP保护 UP protection | MP进一步保护 MP further protection |
|---|---|---|
例外情况 Exceptions | 信号 Semaphore | 没有任何 None |
中断 Interrupts | 本地中断禁用 Local interrupt disabling | 自旋锁 Spin lock |
可延迟函数 Deferrable functions | 没有任何 None | 无或自旋锁(见表5-9) None or spin lock (see Table 5-9) |
异常+中断 Exceptions + Interrupts | 本地中断禁用 Local interrupt disabling | 自旋锁 Spin lock |
异常+可延迟函数 Exceptions + Deferrable functions | 本地软中断禁用 Local softirq disabling | 自旋锁 Spin lock |
中断+可延迟函数 Interrupts + Deferrable functions | 本地中断禁用 Local interrupt disabling | 自旋锁 Spin lock |
异常+中断+可延迟函数 Exceptions + Interrupts + Deferrable functions | 本地中断禁用 Local interrupt disabling | 自旋锁 Spin lock |
当数据结构仅由异常处理程序访问时,竞争条件通常很容易理解和预防。引起同步问题的最常见异常是系统调用服务例程(参见第10章中的“系统调用处理程序和服务例程”部分),其中CPU在内核模式下运行,为用户模式程序提供服务。因此,仅由异常访问的数据结构通常表示可以分配给一个或多个进程的资源。
When a data structure is accessed only by exception handlers, race conditions are usually easy to understand and prevent. The most common exceptions that give rise to synchronization problems are the system call service routines (see the section "System Call Handler and Service Routines" in Chapter 10) in which the CPU operates in Kernel Mode to offer a service to a User Mode program. Thus, a data structure accessed only by an exception usually represents a resource that can be assigned to one or more processes.
通过信号量可以避免竞争条件,因为这些原语允许进程休眠,直到资源变得可用。请注意,信号量在单处理器和多处理器系统中都同样有效。
Race conditions are avoided through semaphores, because these primitives allow the process to sleep until the resource becomes available. Notice that semaphores work equally well both in uniprocessor and multiprocessor systems.
内核抢占也不会产生问题。如果拥有信号量的进程被抢占,则在同一 CPU 上运行的新进程可能会尝试获取信号量。当这种情况发生时,新进程将进入睡眠状态,最终旧进程将释放信号量。必须显式禁用内核抢占的唯一情况是访问每 CPU 变量时,如本章前面的“每 CPU 变量”部分所述。
Kernel preemption does not create problems either. If a process that owns a semaphore is preempted, a new process running on the same CPU could try to get the semaphore. When this occurs, the new process is put to sleep, and eventually the old process will release the semaphore. The only case in which kernel preemption must be explicitly disabled is when accessing per-CPU variables, as explained in the section "Per-CPU Variables" earlier in this chapter.
假设数据结构仅由中断处理程序的“上半部分”访问。我们在第 4 章的“中断处理” 部分中了解到,每个中断处理程序都是相对于自身进行序列化的,也就是说,它不能同时执行多次。因此,访问数据结构不需要同步原语。
Suppose that a data structure is accessed by only the "top half" of an interrupt handler. We learned in the section "Interrupt Handling" in Chapter 4 that each interrupt handler is serialized with respect to itself — that is, it cannot execute more than once concurrently. Thus, accessing the data structure does not require synchronization primitives.
然而,如果数据结构由多个中断处理程序访问,情况就会有所不同。一个处理程序可以中断另一个处理程序,并且不同的中断处理程序可以在多处理器系统中同时运行。如果没有同步,共享数据结构可能很容易被损坏。
Things are different, however, if the data structure is accessed by several interrupt handlers. A handler may interrupt another handler, and different interrupt handlers may run concurrently in multiprocessor systems. Without synchronization, the shared data structure might easily become corrupted.
在单处理器系统中,必须通过禁用中断处理程序的所有关键区域中的中断来避免竞争条件。没有什么比这更有效的了,因为没有其他同步原语可以完成这项工作。信号量可以阻塞进程,因此不能在中断处理程序中使用。另一方面,自旋锁可能会冻结系统:如果访问数据结构的处理程序被中断,则无法释放锁;因此,新的中断处理程序会继续等待自旋锁的紧密循环。
In uniprocessor systems, race conditions must be avoided by disabling interrupts in all critical regions of the interrupt handler. Nothing less will do because no other synchronization primitives accomplish the job. A semaphore can block the process, so it cannot be used in an interrupt handler. A spin lock, on the other hand, can freeze the system: if the handler accessing the data structure is interrupted, it cannot release the lock; therefore, the new interrupt handler keeps waiting on the tight loop of the spin lock.
与往常一样,多处理器系统的要求甚至更高。竞争条件不能通过简单地禁用本地中断来避免。事实上,即使某个CPU 上的中断被禁用,中断处理程序仍然可以在其他CPU 上执行。防止竞争条件最方便的方法是禁用本地中断(以便同一 CPU 上运行的其他中断处理程序不会干扰)并获取保护数据结构的自旋锁或读/写自旋锁。请注意,这些额外的自旋锁无法冻结系统,因为即使中断处理程序发现锁已关闭,最终拥有该锁的另一个 CPU 上的中断处理程序也会释放它。
Multiprocessor systems, as usual, are even more demanding. Race conditions cannot be avoided by simply disabling local interrupts. In fact, even if interrupts are disabled on a CPU, interrupt handlers can still be executed on the other CPUs. The most convenient method to prevent the race conditions is to disable local interrupts (so that other interrupt handlers running on the same CPU won't interfere) and to acquire a spin lock or a read/write spin lock that protects the data structure. Notice that these additional spin locks cannot freeze the system because even if an interrupt handler finds the lock closed, eventually the interrupt handler on the other CPU that owns the lock will release it.
Linux 内核使用多个宏将本地中断的启用和禁用与自旋锁处理结合起来。表 5-9描述了所有这些内容。在单处理器系统中,这些宏仅启用或禁用本地中断和内核抢占。
The Linux kernel uses several macros that couple the enabling and disabling of local interrupts with spin lock handling. Table 5-9 describes all of them. In uniprocessor systems, these macros just enable or disable local interrupts and kernel preemption.
表 5-9。中断感知自旋锁宏
Table 5-9. Interrupt-aware spin lock macros
宏 Macro | 描述 Description |
|---|---|
| |
| 自旋解锁 spin_unlock |
| |
| 自旋解锁 spin_unlock |
| |
| |
读锁中断 read_lock_irq | |
| |
读锁bh read_lock_bh | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
read_seqbegin_irqsave(l,f) read_seqbegin_irqsave(l,f) | local_irq_save(f); 读取序列开始(l) local_irq_save(f); read_seqbegin(l) |
read_seqretry_irqrestore(l,v,f) read_seqretry_irqrestore(l,v,f) | read_seqretry(l,v); 本地中断恢复(f) read_seqretry(l,v); local_irq_restore(f) |
write_seqlock_irqsave(l,f) write_seqlock_irqsave(l,f) | local_irq_save(f); write_seqlock(l) local_irq_save(f); write_seqlock(l) |
write_sequnlock_irqrestore(l,f) write_sequnlock_irqrestore(l,f) | write_sequnlock(l); 本地中断恢复(f) write_sequnlock(l); local_irq_restore(f) |
write_seqlock_irq(l) write_seqlock_irq(l) | local_irq_disable(); write_seqlock(l) local_irq_disable( ); write_seqlock(l) |
write_sequnlock_irq(l) write_sequnlock_irq(l) | write_sequnlock(l); local_irq_enable() write_sequnlock(l); local_irq_enable( ) |
write_seqlock_bh(l) write_seqlock_bh(l) | local_bh_disable(); write_seqlock(l); local_bh_disable( ); write_seqlock(l); |
write_sequnlock_bh(l) write_sequnlock_bh(l) | write_sequnlock(l); local_bh_enable() write_sequnlock(l); local_bh_enable( ) |
仅由可延迟函数访问的数据结构需要什么样的保护?嗯,这主要取决于可延迟函数的类型。在第 4 章的“ Softirqs 和 Tasklet ” 部分中,我们解释了 Softirqs 和 Tasklet 的本质区别在于它们的并发程度。
What kind of protection is required for a data structure accessed only by deferrable functions? Well, it mostly depends on the kind of deferrable function. In the section "Softirqs and Tasklets" in Chapter 4, we explained that softirqs and tasklets essentially differ in their degree of concurrency.
首先,单处理器系统中不可以存在竞争条件。这是因为可延迟函数的执行始终在 CPU 上串行化,即一个可延迟函数不能被另一个可延迟函数中断。因此,不需要同步原语。
First of all, no race condition may exist in uniprocessor systems. This is because execution of deferrable functions is always serialized on a CPU — that is, a deferrable function cannot be interrupted by another deferrable function. Therefore, no synchronization primitive is ever required.
相反,在多处理器系统中,竞争条件确实存在,因为多个可延迟函数可能同时运行。 表 5-10列出了所有可能的情况。
Conversely, in multiprocessor systems, race conditions do exist because several deferrable functions may run concurrently. Table 5-10 lists all possible cases.
表 5-10。SMP 中可延迟函数访问的数据结构所需的保护
Table 5-10. Protection required by data structures accessed by deferrable functions in SMP
访问数据结构的可延迟函数 Deferrable functions accessing the data structure | 保护 Protection |
|---|---|
软中断 Softirqs | 自旋锁 Spin lock |
一个小任务 One tasklet | 没有任何 None |
许多小线程 Many tasklets | 自旋锁 Spin lock |
由软中断访问的数据结构必须始终受到保护,通常通过自旋锁来保护,因为同一个软中断可能同时在两个或多个 CPU 上运行。相反,仅由一种微线程访问的数据结构不需要受到保护,因为同类微线程不能同时运行。然而,如果数据结构由多种微线程访问,则必须对其进行保护。
A data structure accessed by a softirq must always be protected, usually by means of a spin lock, because the same softirq may run concurrently on two or more CPUs. Conversely, a data structure accessed by just one kind of tasklet need not be protected, because tasklets of the same kind cannot run concurrently. However, if the data structure is accessed by several kinds of tasklets, then it must be protected.
现在让我们考虑一个由异常(例如,系统调用服务例程)和中断处理程序访问的数据结构。
Let's consider now a data structure that is accessed both by exceptions (for instance, system call service routines) and interrupt handlers.
在单处理器系统上,竞争条件预防非常简单,因为中断处理程序不可重入并且不能被异常中断。只要内核在禁用本地中断的情况下访问数据结构,内核在访问数据结构时就不会被中断。然而,如果数据结构仅由一种中断处理程序访问,则中断处理程序可以自由访问该数据结构,而无需禁用本地中断。
On uniprocessor systems, race condition prevention is quite simple, because interrupt handlers are not reentrant and cannot be interrupted by exceptions. As long as the kernel accesses the data structure with local interrupts disabled, the kernel cannot be interrupted when accessing the data structure. However, if the data structure is accessed by just one kind of interrupt handler, the interrupt handler can freely access the data structure without disabling local interrupts.
在多处理器系统上,我们必须处理其他 CPU 上的异常和中断的并发执行。本地中断禁用必须与自旋锁结合使用,这会强制并发内核控制路径等待,直到访问数据结构的处理程序完成其工作。
On multiprocessor systems, we have to take care of concurrent executions of exceptions and interrupts on other CPUs. Local interrupt disabling must be coupled with a spin lock, which forces the concurrent kernel control paths to wait until the handler accessing the data structure finishes its work.
有时,用信号量代替自旋锁可能更好。由于中断处理程序无法挂起,因此它们必须使用紧密循环和函数来获取信号量down_trylock( );对于他们来说,信号量本质上充当自旋锁。另一方面,当信号量繁忙时,系统调用服务例程可能会挂起调用进程。对于大多数系统调用,这是预期的行为。在这种情况下,信号量比自旋锁更可取,因为它们可以带来更高程度的系统并发性。
Sometimes it might be preferable to replace the spin lock with
a semaphore. Because interrupt handlers cannot be suspended, they
must acquire the semaphore using a tight loop and the down_trylock( ) function; for them, the
semaphore acts essentially as a spin lock. System call service
routines, on the other hand, may suspend the calling processes when
the semaphore is busy. For most system calls, this is the expected
behavior. In this case, semaphores are preferable to spin locks,
because they lead to a higher degree of concurrency of the
system.
由异常处理程序和可延迟函数访问的数据结构可以被视为由异常和中断处理程序访问的数据结构。事实上,可延迟函数本质上是由中断发生激活的,并且在可延迟函数运行时不会引发异常。因此,将本地中断禁用与自旋锁结合起来就足够了。
A data structure accessed both by exception handlers and deferrable functions can be treated like a data structure accessed by exception and interrupt handlers. In fact, deferrable functions are essentially activated by interrupt occurrences, and no exception can be raised while a deferrable function is running. Coupling local interrupt disabling with a spin lock is therefore sufficient.
实际上,这已经足够了:异常处理程序可以通过使用宏简单地禁用可延迟函数而不是本地中断(请参阅第 4 章中的“ Softirqs ”local_bh_disable(
)部分)。仅禁用可延迟函数比禁用中断更好,因为中断继续由 CPU 提供服务。每个 CPU 上的可延迟函数的执行都是串行的,因此不存在竞争条件。
Actually, this is much more than sufficient: the exception
handler can simply disable deferrable functions instead of local
interrupts by using the local_bh_disable(
) macro (see the section "Softirqs" in Chapter 4). Disabling only the
deferrable functions is preferable to disabling interrupts, because
interrupts continue to be serviced by the CPU. Execution of
deferrable functions on each CPU is serialized, so no race condition
exists.
与往常一样,在多处理器系统中,需要自旋锁来确保数据结构在任何时候都只能由一个内核控制访问。
As usual, in multiprocessor systems, spin locks are required to ensure that the data structure is accessed at any time by just one kernel control.
这种情况类似于中断和异常处理程序访问的数据结构。当可延迟函数运行时可能会引发中断,但任何可延迟函数都无法停止中断处理程序。因此,必须通过在可延迟函数期间禁用本地中断来避免竞争条件。然而,中断处理程序可以自由地访问由可延迟函数访问的数据结构,而无需禁用中断,前提是没有其他中断处理程序访问该数据结构。
This case is similar to that of a data structure accessed by interrupt and exception handlers. An interrupt might be raised while a deferrable function is running, but no deferrable function can stop an interrupt handler. Therefore, race conditions must be avoided by disabling local interrupts during the deferrable function. However, an interrupt handler can freely touch the data structure accessed by the deferrable function without disabling interrupts, provided that no other interrupt handler accesses that data structure.
同样,在多处理器系统中,始终需要自旋锁来禁止对多个 CPU 上的数据结构的并发访问。
Again, in multiprocessor systems, a spin lock is always required to forbid concurrent accesses to the data structure on several CPUs.
与之前的情况类似,禁用本地中断并获取自旋锁几乎总是必要的,以避免竞争条件。请注意,无需显式禁用可延迟函数,因为它们本质上是在终止中断处理程序的执行时激活的;因此禁用本地中断就足够了。
Similarly to previous cases, disabling local interrupts and acquiring a spin lock is almost always necessary to avoid race conditions. Notice that there is no need to explicitly disable deferrable functions, because they are essentially activated when terminating the execution of interrupt handlers; disabling local interrupts is therefore sufficient.
内核开发人员应该识别并解决由交错内核控制路径引起的同步问题。然而,避免竞争条件是一项艰巨的任务,因为它需要清楚地了解内核的各个组件如何交互。为了让大家了解内核代码的真正内容,我们来提一下本章中定义的同步原语的一些典型用法。
Kernel developers are expected to identify and solve the synchronization problems raised by interleaving kernel control paths. However, avoiding race conditions is a hard task because it requires a clear understanding of how the various components of the kernel interact. To give a feeling of what's really inside the kernel code, let's mention a few typical usages of the synchronization primitives defined in this chapter.
引用计数器在内核内部广泛使用,以避免由于并发分配和释放资源而导致的竞争情况。引用计数器只是与atomic_t特定资源(例如内存页、模块或文件)关联的计数器。当内核控制路径开始使用资源时,计数器自动增加,当内核控制路径完成使用资源时,计数器自动减少。当引用计数器变为零时,表示该资源未被使用,必要时可以将其释放。
Reference counters are widely used inside the kernel to avoid
race conditions due to the concurrent allocation and releasing of a
resource. A reference counter is just an atomic_t counter associated with a specific
resource such as a memory page, a module, or a file. The counter is
atomically increased when a kernel control path starts using the
resource, and it is decreased when a kernel control path finishes
using the resource. When the reference counter becomes zero, the
resource is not being used, and it can be released if
necessary.
在早期的 Linux 内核版本中,一个大的内核锁 (也称为全局内核锁,或BKL)被广泛使用。在Linux 2.0中,这个锁是一种相对粗糙的自旋锁,确保一次只有一个处理器可以在内核模式下运行。2.2 和 2.4 内核更加灵活,不再依赖单个自旋锁;相反,大量的内核数据结构受到许多不同的自旋锁的保护。在 Linux 内核版本 2.6 中,大内核锁用于保护旧代码(主要是与 VFS 和多个文件系统相关的功能)。
In earlier Linux kernel versions, a big kernel lock (also known as global kernel lock, or BKL) was widely used. In Linux 2.0, this lock was a relatively crude spin lock, ensuring that only one processor at a time could run in Kernel Mode. The 2.2 and 2.4 kernels were considerably more flexible and no longer relied on a single spin lock; rather, a large number of kernel data structures were protected by many different spin locks. In Linux kernel version 2.6, the big kernel lock is used to protect old code (mostly functions related to the VFS and to several filesystems).
从内核版本2.6.11开始,大内核锁由名为 的信号量实现kernel_sem(在早期的2.6版本中,大内核锁是通过自旋锁实现的)。然而,大内核锁比简单的信号量稍微复杂一些。
Starting from kernel version 2.6.11, the big kernel lock is
implemented by a semaphore named kernel_sem (in earlier 2.6 versions, the big
kernel lock was implemented by means of a spin lock). The big kernel
lock is slightly more sophisticated than a simple semaphore,
however.
每个进程描述符都包含一个lock_depth字段,该字段允许同一进程多次获取大内核锁。因此,连续两次请求它不会挂起处理器(与普通锁一样)。如果进程没有获得锁,则该字段的值为-1;否则,字段值加 1 指定已获取锁定的次数。该lock_depth字段对于允许中断处理程序、异常处理程序和可延迟函数获取大内核锁至关重要:如果没有它,如果当前进程已经拥有该锁,则每个尝试获取大内核锁的异步函数都可能会生成死锁。
Every process descriptor includes a lock_depth field, which allows the same
process to acquire the big kernel lock several times. Therefore, two
consecutive requests for it will not hang the processor (as for normal
locks). If the process has not acquired the lock, the field has the
value -1; otherwise, the field value plus 1 specifies how many times
the lock has been taken. The lock_depth field is crucial for allowing
interrupt handlers, exception handlers, and deferrable functions to
take the big kernel lock: without it, every asynchronous function that
tries to get the big kernel lock could generate a deadlock if the
current process already owns the lock.
和lock_kernel( )函数
unlock_kernel( )用于获取和释放大内核锁。前一个函数相当于:
The lock_kernel( ) and
unlock_kernel( ) functions are used
to get and release the big kernel lock. The former function is
equivalent to:
深度 = 当前->lock_深度 + 1;
如果(深度==0)
向下(&kernel_sem);
当前->lock_深度 = 深度; depth = current->lock_depth + 1;
if (depth == 0)
down(&kernel_sem);
current->lock_depth = depth;而后者相当于:
while the latter is equivalent to:
if (--当前->lock_深度 < 0)
向上(&kernel_sem); if (--current->lock_depth < 0)
up(&kernel_sem);请注意,和函数if的语句不需要以原子方式执行,因为不是全局变量 — 每个 CPU 都会寻址其自己当前进程描述符的一个字段。语句内的本地中断也不会引起竞争条件。即使新的内核控制路径调用,它也必须在终止之前释放大内核锁。lock_kernel( )unlock_kernel( )lock_depthiflock_kernel( )
Notice that the if statements
of the lock_kernel( ) and unlock_kernel( ) functions need not be
executed atomically because lock_depth is not a global variable — each
CPU addresses a field of its own current process descriptor. Local
interrupts inside the if statements
do not induce race conditions either. Even if the new kernel control
path invokes lock_kernel( ), it
must release the big kernel lock before terminating.
令人惊讶的是,持有大内核锁的进程可以调用schedule( ),从而放弃 CPU。schedule(
)然而,该函数检查lock_depth被替换的进程的字段,如果其值为零或正数,则自动释放信号量(请参阅第 7 章中的“ schedule( ) 函数kernel_sem”部分)。因此,显式调用的进程都无法在进程切换期间保持大内核锁。然而,当再次选择执行该进程时,该函数将重新获取该进程的大内核锁。schedule( )schedule(
)
Surprisingly enough, a process holding the big kernel lock is
allowed to invoke schedule( ), thus
relinquishing the CPU. The schedule(
) function, however, checks the lock_depth field of the process being
replaced and, if its value is zero or positive, automatically releases
the kernel_sem semaphore (see the
section "The schedule( )
Function" in Chapter
7). Thus, no process that explicitly invokes schedule( ) can keep the big kernel lock
across the process switch. The schedule(
) function, however, will reacquire the big kernel lock for
the process when it will be selected again for execution.
然而,如果持有大内核锁的进程被另一个进程抢占,情况就会不同。在内核版本 2.6.10 之前,这种情况不会发生,因为获取自旋锁会自动禁用内核抢占。然而,大内核锁的当前实现是基于信号量的,并且获取它不会自动禁用内核抢占。实际上,允许在受大内核锁保护的关键区域内进行内核抢占是改变其实现的主要原因。这反过来又对系统的响应时间产生有益的影响。
Things are different, however, if a process that holds the big kernel lock is preempted by another process. Up to kernel version 2.6.10 this case could not occur, because acquiring a spin lock automatically disables kernel preemption. The current implementation of the big kernel lock, however, is based on a semaphore, and acquiring it does not automatically disable kernel preemption. Actually, allowing kernel preemption inside critical regions protected by the big kernel lock has been the main reason for changing its implementation. This, in turn, has beneficial effects on the response time of the system.
当持有大内核锁的进程被抢占时,
schedule( )一定不能释放信号量,因为在临界区执行代码的进程并没有主动触发进程切换,因此如果大内核锁被释放,另一个进程可能会占用它并损坏被抢占进程访问的数据结构。
When a process holding the big kernel lock is preempted,
schedule( ) must not release the
semaphore because the process executing the code in the critical
region has not voluntarily triggered a process switch, thus if the big
kernel lock would be released, another process might take it and
corrupt the data structures accessed by the preempted process.
为了避免被抢占的进程失去大内核锁,该
preempt_schedule_irq( )函数暂时将lock_depth
进程的字段设置为1(参见第4章“从中断和异常返回”-一节)。查看该字段的值,假设被替换的进程不持有信号量,因此不会释放它。结果,信号量继续由被抢占的进程拥有。一旦调度程序再次选择该进程,该函数就会恢复该进程的原始值schedule( )kernel_semkernel_sempreempt_schedule_irq(
)lock_depth字段并让进程在受大内核锁保护的临界区中恢复执行。
To avoid the preempted process losing the big kernel lock, the
preempt_schedule_irq( ) function
temporarily sets the lock_depth
field of the process to -1 (see the
section "Returning from
Interrupts and Exceptions" in Chapter 4). Looking at the value
of this field, schedule( ) assumes
that the process being replaced does not hold the kernel_sem semaphore and thus does not
release it. As a result, the kernel_sem semaphore continues to be owned
by the preempted process. Once this process is selected again by the
scheduler, the preempt_schedule_irq(
) function restores the original value of the lock_depth field and lets the process resume
execution in the critical section protected by the big kernel
lock.
每个类型的内存描述符mm_struct在字段中都包含自己的信号量
(请参阅第 9 章中的“内存描述符”mmap_sem部分)。信号量可以保护描述符免受由于内存描述符可以在多个轻量级进程之间共享而可能出现的竞争条件的影响。
Each memory descriptor of type mm_struct includes its own semaphore in the
mmap_sem field (see the section
"The Memory
Descriptor" in Chapter
9). The semaphore protects the descriptor against race
conditions that could arise because a memory descriptor can be shared
among several lightweight processes.
例如,假设内核必须为某个进程创建或扩展内存区域;为此,它调用该
do_mmap( )函数,该函数分配一个新的vm_area_struct数据结构。这样做时,如果没有可用内存,当前进程可能会被挂起,并且共享相同内存描述符的另一个进程可以运行。如果没有信号量,第二个进程的每个操作都需要访问内存描述符(例如,页面错误由于写入时复制)可能会导致严重的数据损坏。
For instance, let's suppose that the kernel must create or
extend a memory region for some process; to do this, it invokes the
do_mmap( ) function, which
allocates a new vm_area_struct data
structure. In doing so, the current process could be suspended if no
free memory is available, and another process sharing the same memory
descriptor could run. Without the semaphore, every operation of the
second process that requires access to the memory descriptor (for
instance, a Page Fault due to a Copy on Write) could lead to severe data
corruption.
信号量被实现为读/写信号量,因为一些内核函数,例如页面错误异常处理程序(参见第9章“页面错误异常处理程序”部分),只需要扫描内存描述符。
The semaphore is implemented as a read/write semaphore, because some kernel functions, such as the Page Fault exception handler (see the section "Page Fault Exception Handler" in Chapter 9), need only to scan the memory descriptors.
slab缓存描述符列表(参见第8章中的“高速缓存描述符”
部分)受到信号量的保护,信号量授予访问和修改列表的独占权利。cache_chain_sem
The list of slab cache descriptors (see the section "Cache Descriptor" in
Chapter 8) is protected by
the cache_chain_sem semaphore,
which grants an exclusive right to access and modify the list.
kmem_cache_create( )当在列表中添加新元素kmem_cache_shrink(
)并kmem_cache_reap( )
顺序扫描列表时,可能会出现竞争条件。然而,这些函数在处理中断时永远不会被调用,并且它们在访问列表时永远不会阻塞。信号量在多处理器系统和支持内核抢占的单处理器系统中都发挥着积极的作用。
A race condition is possible when kmem_cache_create( ) adds a new element in
the list, while kmem_cache_shrink(
) and kmem_cache_reap( )
sequentially scan the list. However, these functions are never invoked
while handling an interrupt, and they can never block while accessing
the list. The semaphore plays an active role both in multiprocessor
systems and in uniprocessor systems with kernel preemption
supported.
正如我们将在第 12 章的“ Inode 对象”中看到的,Linux 将磁盘文件上的信息存储在称为 inode 的内存对象中。相应的数据结构在
字段中包含其自己的信号量。i_sem
As we'll see in "Inode Objects" in Chapter 12, Linux stores the
information on a disk file in a memory object called an inode. The
corresponding data structure includes its own semaphore in the
i_sem field.
文件系统处理期间可能会出现大量竞争条件。事实上,磁盘上的每个文件都是所有用户共同持有的资源,因为所有进程都可能(可能)访问文件内容、更改其名称或位置、销毁或复制它,等等。例如,假设一个进程列出了某个目录中包含的文件。每个磁盘操作都可能是阻塞的,因此即使在单处理器系统中,当第一个进程正在进行列表操作时,其他进程也可以访问同一目录并修改其内容。或者,两个不同的进程可以同时修改同一目录。通过使用 inode 信号量保护目录文件可以避免所有这些竞争条件。
A huge number of race conditions can occur during filesystem handling. Indeed, each file on disk is a resource held in common for all users, because all processes may (potentially) access the file content, change its name or location, destroy or duplicate it, and so on. For example, let's suppose that a process lists the files contained in some directory. Each disk operation is potentially blocking, and therefore even in uniprocessor systems, other processes could access the same directory and modify its content while the first process is in the middle of the listing operation. Or, again, two different processes could modify the same directory at the same time. All these race conditions are avoided by protecting the directory file with the inode semaphore.
每当程序使用两个或多个信号量时,就有可能出现死锁,因为两条不同的路径最终可能会等待对方释放信号量。一般来说,Linux 很少出现信号量请求死锁的问题,因为每个内核控制路径通常一次只需要获取一个信号量。然而,在某些情况下,内核必须获得两个或更多锁。inode 信号量很容易出现这种情况;例如,这发生在rename(
)系统调用的服务例程中。在这种情况下,操作涉及两个不同的 inode,因此必须获取两个信号量。为了避免此类死锁,信号量请求按预定义的地址顺序执行。
Whenever a program uses two or more semaphores, the potential
for deadlock is present, because two different paths could end up
waiting for each other to release a semaphore. Generally speaking,
Linux has few problems with deadlocks on semaphore requests, because
each kernel control path usually needs to acquire just one semaphore
at a time. However, in some cases, the kernel must get two or more
locks. Inode semaphores are prone to this scenario; for instance, this
occurs in the service routine in the rename(
) system call. In this case, two different inodes are
involved in the operation, so both semaphores must be taken. To avoid
such deadlocks, semaphore requests are performed in predefined address
order.
无数的计算机化活动都是由计时测量驱动的,经常在用户背后。例如,如果在您停止使用计算机控制台后屏幕自动关闭,这是由于计时器允许内核跟踪自您按下按键或移动鼠标以来已经过去了多少时间。如果您收到系统警告,要求您删除一组未使用的文件,则这是程序识别出所有长时间未访问的用户文件的结果。要执行这些操作,程序必须能够从每个文件中检索标识其上次访问时间的时间戳。这样的时间戳必须由内核自动写入。更重要的是,定时驱动进程切换以及更明显的内核活动,例如检查超时。
Countless computerized activities are driven by timing measurements , often behind the user's back. For instance, if the screen is automatically switched off after you have stopped using the computer's console, it is due to a timer that allows the kernel to keep track of how much time has elapsed since you pushed a key or moved the mouse. If you receive a warning from the system asking you to remove a set of unused files, it is the outcome of a program that identifies all user files that have not been accessed for a long time. To do these things, programs must be able to retrieve a timestamp identifying its last access time from each file. Such a timestamp must be automatically written by the kernel. More significantly, timing drives process switches along with even more visible kernel activities such as checking for time-outs.
我们可以区分 Linux 内核必须执行的两种主要计时测量:
We can distinguish two main kinds of timing measurement that must be performed by the Linux kernel:
保留当前时间和日期,以便可以通过 、 和 API 将它们返回给用户程序time( )(
ftime( )请参阅本章后面的“ time( ) 和 gettimeofday( ) 系统调用gettimeofday( )”部分),并由内核本身用作时间戳文件和网络数据包
Keeping the current time and date so they can be returned to
user programs through the time( ),
ftime( ), and gettimeofday( ) APIs (see the section "The time( ) and gettimeofday( )
System Calls" later in this chapter) and used by the kernel
itself as timestamps for files and network packets
维护定时器——能够通知内核的机制(参见后面的“软件定时器和延迟函数”部分)或用户程序(参见后面的“ setitimer()和alarm()系统调用”和“系统调用”) POSIX Timers") 表示已经过了一定的时间间隔
Maintaining timers — mechanisms that are able to notify the kernel (see the later section "Software Timers and Delay Functions") or a user program (see the later sections "The setitimer( ) and alarm( ) System Calls" and "System Calls for POSIX Timers") that a certain interval of time has elapsed
定时测量由多个基于固定频率振荡器和计数器的硬件电路执行。本章由四个不同的部分组成。前两节描述了计时的硬件设备,并给出了 Linux 计时架构的整体概况。以下部分描述了内核与时间相关的主要职责:实现CPU时间共享、更新系统时间和资源使用统计数据以及维护软件计时器。最后一节讨论与定时测量相关的系统调用和相应的服务例程。
Timing measurements are performed by several hardware circuits based on fixed-frequency oscillators and counters. This chapter consists of four different parts. The first two sections describe the hardware devices that underly timing and give an overall picture of Linux timekeeping architecture. The following sections describe the main time-related duties of the kernel: implementing CPU time sharing, updating system time and resource usage statistics, and maintaining software timers. The last section discusses the system calls related to timing measurements and the corresponding service routines.
在80×86架构上,内核必须显式地与几种时钟交互和定时器电路。时钟电路既用于跟踪当前时间,又用于进行精确的时间测量。定时器电路由内核编程,以便它们以固定的、预定义的频率发出中断;这种周期性中断对于实现内核和用户程序使用的软件定时器至关重要。现在我们将简要描述 IBM 兼容 PC 中的时钟和硬件电路。
On the 80×86 architecture, the kernel must explicitly interact with several kinds of clocks and timer circuits . The clock circuits are used both to keep track of the current time of day and to make precise time measurements. The timer circuits are programmed by the kernel, so that they issue interrupts at a fixed, predefined frequency; such periodic interrupts are crucial for implementing the software timers used by the kernel and the user programs. We'll now briefly describe the clock and hardware circuits that can be found in IBM-compatible PCs.
所有 PC 都包含一个称为实时时钟( RTC )的时钟,它独立于 CPU 和所有其他芯片。
All PCs include a clock called Real Time Clock (RTC), which is independent of the CPU and all other chips.
即使 PC 关闭,RTC 也会继续滴答作响,因为它由小电池供电。CMOS RAM 和 RTC 集成在单个芯片中(Motorola 146818 或同等产品)。
The RTC continues to tick even when the PC is switched off, because it is energized by a small battery. The CMOS RAM and RTC are integrated in a single chip (the Motorola 146818 or an equivalent).
RTC 能够以 2 Hz 至 8,192 Hz 的频率在 IRQ 8 上发出周期性中断。它还可以编程为当 RTC 达到特定值时激活 IRQ 8 线,从而充当闹钟。
The RTC is capable of issuing periodic interrupts on IRQ 8 at frequencies ranging between 2 Hz and 8,192 Hz. It can also be programmed to activate the IRQ 8 line when the RTC reaches a specific value, thus working as an alarm clock.
Linux 仅使用 RTC 来获取时间和日期;然而,它允许进程通过作用于/dev/rtc设备文件来对 RTC 进行编程(参见第 13 章)。内核通过0x70和
0x71I/O 端口访问 RTC。系统管理员可以通过执行直接作用于这两个I/O端口的时钟Unix系统程序来读写RTC 。
Linux uses the RTC only to derive the time and date; however, it
allows processes to program the RTC by acting on the /dev/rtc device file (see Chapter 13). The kernel accesses
the RTC through the 0x70 and
0x71 I/O ports. The system
administrator can read and write the RTC by executing the clock Unix system program that acts
directly on these two I/O ports.
所有80×86微处理器都包含一个CLK输入引脚,用于接收外部振荡器的时钟信号。从 Pentium 开始,80×86 微处理器配备一个计数器,该计数器在每个时钟信号时都会增加。该计数器可通过 64 位
时间戳计数器( TSC ) 寄存器进行访问,该寄存器可通过rdtsc 汇编语言指令。使用此寄存器时,内核必须考虑时钟信号的频率:例如,如果时钟以 1 GHz 滴答,则时间戳计数器每纳秒增加一次。
All 80×86 microprocessors include a CLK input pin, which
receives the clock signal of an external oscillator. Starting with the
Pentium, 80×86 microprocessors sport a counter that is increased at
each clock signal. The counter is accessible through the 64-bit
Time Stamp Counter(TSC)
register, which can be read by means of the rdtsc assembly language instruction. When using this
register, the kernel has to take into consideration the frequency of
the clock signal: if, for instance, the clock ticks at 1 GHz, the Time
Stamp Counter is increased once every nanosecond.
Linux 可以利用该寄存器来获得比可编程间隔定时器提供的时间测量更准确的时间测量。为此,Linux 必须在初始化系统时确定时钟信号频率。事实上,由于编译内核时并未声明该频率,因此相同的内核映像可以在时钟可能以任何频率运行的 CPU 上运行。
Linux may take advantage of this register to get much more accurate time measurements than those delivered by the Programmable Interval Timer. To do this, Linux must determine the clock signal frequency while initializing the system. In fact, because this frequency is not declared when compiling the kernel, the same kernel image may run on CPUs whose clocks may tick at any frequency.
计算 CPU 实际频率的任务是在系统引导期间完成的。该calibrate_tsc( )函数通过计算大约 5 毫秒的时间间隔内出现的时钟信号的数量来计算频率。该时间常数是通过正确设置可编程间隔定时器的通道之一来产生的(参见下一节)。[ * ]
The task of figuring out the actual frequency of a CPU is
accomplished during the system's boot. The calibrate_tsc( ) function computes the
frequency by counting the number of clock signals that occur in a time
interval of approximately 5 milliseconds. This time constant is
produced by properly setting up one of the channels of the
Programmable Interval Timer (see the next section).[*]
除了实时时钟和时间戳计数器之外,IBM 兼容 PC 还包括另一种类型的时间测量设备,称为可编程间隔定时器( PIT )。PIT 的作用类似于微波炉的闹钟:它让用户意识到烹饪时间间隔已过。该设备不会发出铃声,而是发出一个称为计时器中断的特殊中断,它通知内核又一个时间间隔已经过去。[ † ]与闹钟的另一个区别是 PIT 以内核建立的某个固定频率持续发出中断。0x40每台 IBM 兼容 PC 至少包含一个 PIT,通常由使用I/O 端口的 8254 CMOS 芯片实现0x43
。
Besides the Real Time Clock and the Time Stamp Counter,
IBM-compatible PCs include another type of time-measuring device
called Programmable Interval
Timer(PIT). The role of a PIT is
similar to the alarm clock of a microwave oven: it makes the user
aware that the cooking time interval has elapsed. Instead of ringing a
bell, this device issues a special interrupt called timer
interrupt, which notifies the kernel that one more time
interval has elapsed.[†] Another difference from the alarm clock is that the PIT
goes on issuing interrupts forever at some fixed frequency established
by the kernel. Each IBM-compatible PC includes at least one PIT, which
is usually implemented by an 8254 CMOS chip using the 0x40-0x43
I/O ports.
正如我们将在接下来的段落中详细看到的,Linux 对 IBM 兼容 PC 的 PIT 进行编程以发出定时器中断以(大约)1000 Hz 的频率在 IRQ 0 上 — 即每 1 毫秒一次。该时间间隔称为“
刻度”,其长度(以纳秒为单位)存储在tick_nsec变量中。在 PC 上,
tick_nsec被初始化为 999,848 纳秒(产生约 1000.15 Hz 的时钟信号频率),但如果计算机与外部时钟同步,则其值可能会由内核自动调整(请参阅后面的“adjtimex( )”部分)系统调用”)。蜱虫系统中所有活动的节拍时间;从某种意义上说,它们就像音乐家排练时节拍器发出的滴答声。
As we'll see in detail in the next paragraphs, Linux programs
the PIT of IBM-compatible PCs to issue timer interrupts on the IRQ 0 at a (roughly) 1000-Hz frequency — that
is, once every 1 millisecond. This time interval is called a
tick, and its length in nanoseconds is stored in
the tick_nsec variable. On a PC,
tick_nsec is initialized to 999,848
nanoseconds (yielding a clock signal frequency of about 1000.15 Hz),
but its value may be automatically adjusted by the kernel if the
computer is synchronized with an external clock (see the later section
"The adjtimex( ) System
Call"). The ticks beat time for all activities in the system; in some
sense, they are like the ticks sounded by a metronome while a musician
is rehearsing.
一般来说,较短的时钟周期会产生更高分辨率的计时器,这有助于在执行同步 I/O 复用时实现更流畅的多媒体播放和更快的响应时间(poll( ) 和select( )
系统调用)。然而,这是一个权衡:更短的时钟周期要求 CPU 将大部分时间花在内核模式上,也就是说,将更小的时间花在用户模式上。结果,用户程序运行速度变慢。
Generally speaking, shorter ticks result in higher resolution
timers, which help with smoother multimedia playback and faster
response time when performing synchronous I/O multiplexing (poll( ) and select( )
system calls). This is a trade-off however: shorter
ticks require the CPU to spend a larger fraction of its time in Kernel
Mode — that is, a smaller fraction of time in User Mode. As a
consequence, user programs run slower.
定时器中断的频率取决于硬件架构。较慢的机器的滴答时间约为 10 毫秒(每秒 100 个定时器中断),而较快的机器的滴答时间约为 1 毫秒(每秒 1000 或 1024 个定时器中断)。
The frequency of timer interrupts depends on the hardware architecture. The slower machines have a tick of roughly 10 milliseconds (100 timer interrupts per second), while the faster ones have a tick of roughly 1 millisecond (1000 or 1024 timer interrupts per second).
Linux 代码中的一些宏产生一些确定定时器中断频率的常量。这些将在下面的列表中讨论。
A few macros in the Linux code yield some constants that determine the frequency of timer interrupts. These are discussed in the following list.
HZ产生每秒定时器中断的近似数量——即它们的频率。对于 IBM PC,该值设置为 1000。
HZ yields the approximate
number of timer interrupts per second — that is, their frequency.
This value is set to 1000 for IBM PCs.
CLOCK_TICK_RATE产生值 1,193,182,这是 8254 芯片的内部振荡器频率。
CLOCK_TICK_RATE yields
the value 1,193,182, which is the 8254 chip's internal oscillator
frequency.
LATCHCLOCK_TICK_RATE产生和
之间的比率HZ,四舍五入到最接近的整数。它用于对 PIT 进行编程。
LATCH yields the ratio
between CLOCK_TICK_RATE and
HZ, rounded to the nearest
integer. It is used to program the PIT.
PIT 的初始化setup_pit_timer( )如下:
The PIT is initialized by setup_pit_timer( ) as follows:
spin_lock_irqsave(&i8253_lock, 标志);
outb_p(0x34,0x43);
乌德莱(10);
outb_p(锁存器 & 0xff, 0x40);
乌德莱(10);
外线
(锁存>> 8,0x40);
spin_unlock_irqrestore(&i8253_lock, 标志); spin_lock_irqsave(&i8253_lock, flags);
outb_p(0x34,0x43);
udelay(10);
outb_p(LATCH & 0xff, 0x40);
udelay(10);
outb
(LATCH >> 8, 0x40);
spin_unlock_irqrestore(&i8253_lock, flags);Coutb( )函数相当于outb汇编语言指令:它将第一个操作数复制到指定为第二个操作数的 I/O 端口。该outb_p(
)功能与 类似outb(
),不同之处在于它通过执行无操作指令引入暂停,以防止硬件混乱。该宏引入了更小的延迟(参见后面的“延迟函数udelay()”部分)。第一次调用是向 PIT 发出以新速率发出中断的命令。接下来的两个
调用向设备提供新的中断率。16位常量被发送到8位outb_ p( )outb_ p( )outb( )LATCH0x40设备的 I/O 端口作为两个连续字节。因此,PIT 以(大约)1000 Hz 的频率(即每 1 ms 一次)发出定时器中断。
The outb( ) C function is
equivalent to the outb assembly
language instruction: it copies the first operand into the I/O port
specified as the second operand. The outb_p(
) function is similar to outb(
), except that it introduces a pause by executing a no-op
instruction to keep the hardware from getting confused. The udelay() macro introduces a further small
delay (see the later section "Delay Functions"). The
first outb_ p( ) invocation is a
command to the PIT to issue interrupts at a new rate. The next two
outb_ p( ) and outb( ) invocations supply the new interrupt
rate to the device. The 16-bit LATCH constant is sent to the 8-bit 0x40 I/O port of the device as two
consecutive bytes. As a result, the PIT issues timer interrupts at a
(roughly) 1000-Hz frequency (that is, once every 1 ms).
最新 80 × 86 微处理器中的本地 APIC(请参阅第 4 章中的“中断和异常”部分)提供了另一种时间测量设备: CPU 本地定时器 。
The local APIC present in recent 80 × 86 microprocessors (see the section "Interrupts and Exceptions" in Chapter 4) provides yet another time-measuring device: the CPU local timer .
CPU本地定时器是一种类似于刚才描述的可编程间隔定时器的设备,可以发出一次性或周期性中断。然而,也存在一些差异:
The CPU local timer is a device similar to the Programmable Interval Timer just described that can issue one-shot or periodic interrupts. There are, however, a few differences:
APIC的定时器计数器为32位长,而PIT的定时器计数器为16位长;因此,本地定时器可以编程为以非常低的频率发出中断(计数器存储发出中断之前必须经过的滴答数)。
The APIC's timer counter is 32 bits long, while the PIT's timer counter is 16 bits long; therefore, the local timer can be programmed to issue interrupts at very low frequencies (the counter stores the number of ticks that must elapse before the interrupt is issued).
本地 APIC 定时器仅向其处理器发送中断,而 PIT 引发全局中断,该中断可由系统中的任何 CPU 处理。
The local APIC timer sends an interrupt only to its processor, while the PIT raises a global interrupt, which may be handled by any CPU in the system.
APIC 的计时器基于总线时钟信号(或旧机器中的 APIC 总线信号)。可以对其进行编程,以每 1、2、4、8、16、32、64 或 128 个总线时钟信号减少定时器计数器。相反,PIT 使用自己的时钟信号,可以以更灵活的方式进行编程。
The APIC's timer is based on the bus clock signal (or the APIC bus signal, in older machines). It can be programmed in such a way to decrease the timer counter every 1, 2, 4, 8, 16, 32, 64, or 128 bus clock signals. Conversely, the PIT, which makes use of its own clock signals, can be programmed in a more flexible way.
高精度事件定时器( HPET )是英特尔和微软联合开发的新型定时器芯片。尽管 HPET 在最终用户计算机中还不是很常见,但 Linux 2.6 已经支持它们,因此我们将花几句话描述它们的特性。
The High Precision Event Timer (HPET) is a new timer chip developed jointly by Intel and Microsoft. Although HPETs are not yet very common in end-user machines, Linux 2.6 already supports them, so we'll spend a few words describing their characteristics.
HPET 提供了许多可由内核利用的硬件定时器。基本上,该芯片包括多达八个32位或64位独立计数器 。每个计数器由自己的时钟信号驱动,其频率必须至少为10 MHz;因此,计数器每 100 纳秒至少增加一次。任何计数器最多关联32个定时器,每个定时器由一个比较器和一个匹配寄存器组成 。比较器是一个电路,用于检查计数器中的值与匹配寄存器中的值,如果发现匹配则引发硬件中断。可以启用某些定时器来生成周期性中断。
The HPET provides a number of hardware timers that can be exploited by the kernel. Basically, the chip includes up to eight 32-bit or 64-bit independent counters . Each counter is driven by its own clock signal, whose frequency must be at least 10 MHz; therefore, the counter is increased at least once in 100 nanoseconds. Any counter is associated with at most 32 timers, each of which is composed by a comparator and a match register. The comparator is a circuit that checks the value in the counter against the value in the match register, and raises a hardware interrupt if a match is found. Some of the timers can be enabled to generate a periodic interrupt.
HPET 芯片可以通过映射到内存空间的寄存器进行编程(很像 I/O APIC)。BIOS 在引导阶段建立映射并向操作系统内核报告其初始内存地址。HPET 寄存器允许内核读取和写入计数器和匹配寄存器的值,对一次性中断进行编程,并在支持它们的定时器上启用或禁用周期性中断。
The HPET chip can be programmed through registers mapped into memory space (much like the I/O APIC). The BIOS establishes the mapping during the bootstrapping phase and reports to the operating system kernel its initial memory address. The HPET registers allow the kernel to read and write the values of the counters and of the match registers , to program one-shot interrupts, and to enable or disable periodic interrupts on the timers that support them.
下一代主板可能会同时配备 HPET 和 8254 PIT;然而,在未来的某个时间,HPET 预计将完全取代 PIT。
The next generation of motherboards will likely sport both the HPET and the 8254 PIT; in some future time, however, the HPET is expected to completely replace the PIT.
ACPI 电源管理计时器(或 ACPI PMT)是几乎所有基于 ACPI 的主板中包含的另一种时钟设备。其时钟信号的固定频率约为 3.58 MHz。该设备实际上是一个在每个时钟周期增加的简单计数器;为了读取计数器的当前值,内核访问 I/O 端口,该端口的地址由 BIOS 在初始化阶段确定(参见附录 A)。
The ACPI Power Management Timer (or ACPI PMT) is yet another clock device included in almost all ACPI-based motherboards. Its clock signal has a fixed frequency of roughly 3.58 MHz. The device is actually a simple counter increased at each clock tick; to read the current value of the counter, the kernel accesses an I/O port whose address is determined by the BIOS during the initialization phase (see Appendix A).
如果操作系统或 BIOS 可以动态降低 CPU 的频率或电压以节省电池电量,则 ACPI 电源管理定时器比 TSC 更好。发生这种情况时,TSC 的频率会发生变化,从而导致时间扭曲和其他令人不快的影响,而 ACPI PMT 的频率则不会。另一方面,TSC 计数器的高频对于测量非常小的时间间隔非常方便。
The ACPI Power Management Timer is preferable to the TSC if the operating system or the BIOS may dynamically lower the frequency or voltage of the CPU to save battery power. When this happens, the frequency of the TSC changes—thus causing time warps and others unpleasant effects—while the frequency of the ACPI PMT does not. On the other hand, the high-frequency of the TSC counter is quite handy for measuring very small time intervals.
然而,如果存在 HPET 器件,则由于其更丰富的架构,它应该始终优于其他电路。 本章后面的表 6-2说明了 Linux 如何利用可用的定时电路。
However, if an HPET device is present, it should always be preferred to the other circuits because of its richer architecture. Table 6-2 later in this chapter illustrates how Linux takes advantage of the available timing circuits.
现在我们了解了硬件定时器是什么,我们可以讨论 Linux 内核如何利用它们来执行系统的所有活动。
Now that we understand what the hardware timers are, we may discuss how the Linux kernel exploits them to conduct all activities of the system.
Linux must carry on several time-related activities. For instance, the kernel periodically:
更新自系统启动以来经过的时间。
Updates the time elapsed since system startup.
更新时间和日期。
Updates the time and date.
对于每个CPU,确定当前进程已经运行了多长时间,如果超过了分配给它的时间,则抢占它。时隙(也称为“量子”)的分配将在第 7 章中讨论。
Determines, for every CPU, how long the current process has been running, and preempts it if it has exceeded the time allocated to it. The allocation of time slots (also called "quanta") is discussed in Chapter 7.
更新资源使用统计信息。
Updates resource usage statistics.
检查与每个软件定时器(参见后面的“软件定时器和延迟函数”部分)相关的时间间隔是否已过。
Checks whether the interval of time associated with each software timer (see the later section "Software Timers and Delay Functions") has elapsed.
Linux的计时架构 是与时间流相关的一组内核数据结构和函数。实际上,基于 80 × 86 的多处理器机器的计时架构与单处理器机器的计时架构略有不同:
Linux's timekeeping architecture is the set of kernel data structures and functions related to the flow of time. Actually, 80 × 86-based multiprocessor machines have a timekeeping architecture that is slightly different from the timekeeping architecture of uniprocessor machines:
在单处理器系统中,所有计时活动均由全局计时器(可编程间隔计时器或高精度事件计时器)引发的中断触发。
In a uniprocessor system, all time-keeping activities are triggered by interrupts raised by the global timer (either the Programmable Interval Timer or the High Precision Event Timer).
在多处理器系统中,所有常规活动(例如软件计时器的处理)均由全局计时器引发的中断触发,而CPU特定的活动(例如监视当前正在运行的进程的执行时间)由中断触发由本地 APIC 定时器引发。
In a multiprocessor system, all general activities (such as handling of software timers) are triggered by the interrupts raised by the global timer, while CPU-specific activities (such as monitoring the execution time of the currently running process) are triggered by the interrupts raised by the local APIC timer.
不幸的是,这两种情况之间的区别有些模糊。例如,一些基于 Intel 80486 处理器的早期 SMP 系统没有本地 APIC。即使现在,SMP 主板仍然存在很多问题,以至于本地定时器中断根本无法使用。在这些情况下,SMP 内核必须求助于 UP 计时架构。另一方面,最近的单处理器系统具有一个本地 APIC,因此 UP 内核经常使用 SMP 计时架构。然而,为了简化我们的描述,我们不会讨论这些混合情况,而是坚持两种“纯”计时架构。
Unfortunately, the distinction between the two cases is somewhat blurred. For instance, some early SMP systems based on Intel 80486 processors didn't have local APICs. Even nowadays, there are SMP motherboards so buggy that local timer interrupts are not usable at all. In these cases, the SMP kernel must resort to the UP timekeeping architecture. On the other hand, recent uniprocessor systems feature one local APIC, so the UP kernel often makes use of the SMP timekeeping architecture. However, to simplify our description, we won't discuss these hybrid cases and will stick to the two "pure" timekeeping architectures.
Linux 的计时架构还取决于时间戳计数器 (TSC)、ACPI 电源管理计时器和高精度事件计时器 (HPET) 的可用性。内核使用两个基本的计时函数:一个用于保持当前时间最新,另一个用于计算当前秒内经过的纳秒数。有多种方法可以获取最后的值。如果 CPU 有时间戳计数器或 HPET,某些方法会更精确并且可用;相反的情况则使用不太精确的方法(参见后面的“ time() 和 gettimeofday() 系统调用”部分)。
Linux's timekeeping architecture depends also on the availability of the Time Stamp Counter (TSC), of the ACPI Power Management Timer, and of the High Precision Event Timer (HPET). The kernel uses two basic timekeeping functions: one to keep the current time up-to-date and another to count the number of nanoseconds that have elapsed within the current second. There are different ways to get the last value. Some methods are more precise and are available if the CPU has a Time Stamp Counter or a HPET; a less-precise method is used in the opposite case (see the later section "The time( ) and gettimeofday( ) System Calls").
Linux 2.6 的计时架构使用了大量的数据结构。像往常一样,我们将参考 80 × 86 架构来描述最重要的变量。
The timekeeping architecture of Linux 2.6 makes use of a large number of data structures. As usual, we will describe the most important variables by referring to the 80 × 86 architecture.
为了以统一的方式处理可能的定时器源,内核使用了“定时器对象”,它是一个由timer_opts定时器名称和表 6-1所示的四个标准方法组成的类型描述符。
In order to handle the possible timer sources in a uniform
way, the kernel makes use of a "timer object," which is a descriptor
of type timer_opts consisting of
the timer name and of four standard methods shown in Table 6-1.
表 6-1。timer_opts数据结构的字段
Table 6-1. The fields of the timer_opts data structure
字段名称 Field name | 描述 Description |
|---|---|
| 标识计时器源的字符串 A string identifying the timer source |
| 记录最后一个tick的准确时间;它由定时器中断处理程序调用 Records the exact time of the last tick; it is invoked by the timer interrupt handler |
| 返回自上次刻度以来经过的时间 Returns the time elapsed since the last tick |
单调时钟 monotonic_clock | 返回自内核初始化以来的纳秒数 Returns the number of nanoseconds since the kernel initialization |
延迟 delay | 等待给定数量的“循环”(请参阅后面的“延迟函数”部分) Waits for a given number of "loops" (see the later section "Delay Functions") |
定时器对象最重要的方法是mark_offset和get_offset。该mark_offset方法由定时器中断处理程序调用,并在适当的数据结构中记录滴答发生的确切时间。该方法使用保存的值
get_offset计算自上次定时器中断(滴答声)以来经过的时间(以微秒为单位)。得益于这两种方法,Linux 计时架构实现了分节拍分辨率,即内核能够以远高于节拍持续时间的精度确定当前时间。这个操作称为时间插值
。
The most important methods of the timer object are mark_offset and get_offset. The mark_offset method is invoked by the timer
interrupt handler, and records in a suitable data structure the
exact time at which the tick occurred. Using the saved value, the
get_offset method computes the
time in microseconds elapsed since the last timer interrupt (tick).
Thanks to these two methods, Linux timekeeping architecture achieves
a sub-tick resolution—that is, the kernel is able to determine the
current time with a precision much higher than the tick duration.
This operation is called time interpolation
.
该cur_timer变量存储与系统中可用的“最佳”定时器源相对应的定时器对象的地址。最初,cur_timer指向timer_none,它是与内核初始化时使用的虚拟计时器源相对应的对象。在内核初始化期间,该select_timer( )函数设置cur_timer为适当的计时器对象的地址。表 6-2按优先顺序显示了 80×86 架构中最常用的计时器对象。如您所见,select_timer(
)选择 HPET(如果可用);否则,它选择 ACPI 电源管理定时器,如果有的话,或者 TSC。作为最后的手段,
select_timer( )选择始终存在的 PIT。“时间插值”栏列出了定时器对象的mark_offset
和方法使用的定时器源;get_offset“延迟”列列出了该方法使用的计时器源
delay。
The cur_timer variable
stores the address of the timer object corresponding to the "best"
timer source available in the system. Initially, cur_timer points to timer_none, which is the object
corresponding to a dummy timer source used when the kernel is being
initialized. During kernel initialization, the select_timer( ) function sets cur_timer to the address of the
appropriate timer object. Table 6-2 shows the most
common timer objects used in the 80×86 architecture, in order of
preference. As you see, select_timer(
) selects the HPET, if available; otherwise, it selects
the ACPI Power Management Timer , if available, or the TSC. As the last resort,
select_timer( ) selects the
always-present PIT. The "Time interpolation" column lists the timer
sources used by the mark_offset
and get_offset methods of the
timer object; the "Delay" column lists the timer sources used by the
delay method.
表 6-2。80 x 86 架构的典型计时器对象(按优先顺序排列)
Table 6-2. Typical timer objects of the 80x86 architecture, in order of preference
定时器对象名称 Timer object name | 描述 Description | 时间插值 Time interpolation | 延迟 Delay |
|---|---|---|---|
| 高精度事件定时器 (HPET) High Precision Event Timer (HPET) | 高温PET HPET | 高温PET HPET |
计时器_pmtmr timer_pmtmr | ACPI 电源管理定时器 (ACPI PMT) ACPI Power Management Timer (ACPI PMT) | ACPI PMT ACPI PMT | TSC TSC |
| 时间戳计数器 (TSC) Time Stamp Counter (TSC) | TSC TSC | TSC TSC |
| 可编程间隔定时器 (PIT) Programmable Interval Timer (PIT) | 坑 PIT | 紧环 Tight loop |
| 通用虚拟定时器源(在内核初始化期间使用) Generic dummy timer source(used during kernel initialization) | (没有任何) (none) | 紧环 Tight loop |
请注意,本地 APIC 计时器没有相应的计时器对象。原因是本地 APIC 定时器仅用于生成周期性中断,而从不用于实现子滴答分辨率。
Notice that local APIC timers do not have a corresponding timer object. The reason is that local APIC timers are used only to generate periodic interrupts and are never used to achieve sub-tick resolution.
该jiffies
变量是一个计数器,用于存储自系统启动以来经过的时间间隔数。当定时器中断发生时,即每次滴答时,它都会加一。在 80 × 86 架构中,jiffies是一个 32 位变量,因此它会在大约 50 天内回绕,这对于 Linux 服务器来说是相对较短的时间间隔。jiffies然而,由于time_after、time_after_eq、time_before和宏,内核可以干净地处理溢出time_before_eq:即使发生回绕,它们也会产生正确的值。
The jiffies
variable is a counter that stores the number of elapsed ticks since
the system was started. It is increased by one when a timer
interrupt occurs—that is, on every tick. In the 80 × 86
architecture, jiffies is a 32-bit
variable, therefore it wraps around in approximately 50 days—a
relatively short time interval for a Linux server. However, the
kernel handles cleanly the overflow of jiffies thanks to the time_after, time_after_eq, time_before, and time_before_eq macros: they yield the
correct value even if a wraparound occurred.
您可能认为它jiffies在系统启动时初始化为零。事实上,情况并非如此:jiffies被初始化为0xfffb6c20,对应于 32 位有符号值 —300,000;因此,计数器将在系统启动后五分钟溢出。这是故意这样做的,因此不检查溢出的有缺陷的内核代码jiffies很快就会在开发阶段出现,并且不会在稳定的内核中被忽视。
You might suppose that jiffies is initialized to zero at system
startup. Actually, this is not the case: jiffies is initialized to 0xfffb6c20, which corresponds to the
32-bit signed value —300,000; therefore, the counter will overflow
five minutes after the system boot. This is done on purpose, so that
buggy kernel code that does not check for the overflow of jiffies shows up very soon in the
developing phase and does not pass unnoticed in stable
kernels.
然而,在少数情况下,内核需要自系统启动以来经过的实际系统滴答数,而不管jiffies. 因此,在 80 × 86 架构中,jiffies链接器将变量等同于称为 的 64 位计数器的 32 个较低有效位jiffies_64。以 1 毫秒为单位,jiffies_64变量会在数亿年内回绕,因此我们可以放心地假设它永远不会溢出。
In a few cases, however, the kernel needs the real number of
system ticks elapsed since the system boot, regardless of the
overflows of jiffies. Therefore,
in the 80 × 86 architecture the jiffies variable is equated by the linker
to the 32 less significant bits of a 64-bit counter called jiffies_64. With a tick of 1 millisecond,
the jiffies_64 variable wraps
around in several hundreds of millions of years, thus we can safely
assume that it never overflows.
你可能想知道为什么
在 80 × 86 架构上jiffies没有直接声明为 64 位整数。unsigned long long答案是,无法以原子方式访问 32 位架构中的 64 位变量。因此,每次对整个 64 位的读取操作都需要一些同步技术,以确保在读取两个 32 位半计数器时计数器不会更新;因此,每个 64 位读取操作都比 32 位读取操作慢得多。
You might wonder why jiffies has not been directly declared as
a 64-bit unsigned long long
integer on the 80 × 86 architecture. The answer is that accesses to
64-bit variables in 32-bit architectures cannot be done atomically.
Therefore, every read operation on the whole 64 bits requires some
synchronization technique to ensure that the counter is not updated
while the two 32-bit half-counters are read; as a consequence, every
64-bit read operation is significantly slower than a 32-bit read
operation.
该get_jiffies_64( )
函数读取 的值jiffies_64并返回其值:
The get_jiffies_64( )
function reads the value of jiffies_64 and returns its value:
无符号长长 get_jiffies_64(void)
{
无符号长序列;
无符号长长ret;
做 {
seq = read_seqbegin(&xtime_lock);
ret = jiffies_64;
while (read_seqretry(&xime_lock, seq));
返回ret;
} unsigned long long get_jiffies_64(void)
{
unsigned long seq;
unsigned long long ret;
do {
seq = read_seqbegin(&xtime_lock);
ret = jiffies_64;
} while (read_seqretry(&xime_lock, seq));
return ret;
}64 位读操作受 seqlock 保护(请参阅第 5 章中的“ Seqlocks ”xtime_lock部分):该函数不断读取变量,直到它确定它没有被另一个内核控制路径同时更新。jiffies_64
The 64-bit read operation is protected by the xtime_lock seqlock (see the section "Seqlocks" in Chapter 5): the function keeps
reading the jiffies_64 variable
until it knows for sure that it has not been concurrently updated by
another kernel control path.
相反,增加变量的临界区域jiffies_64必须通过write_seqlock(&xtime_lock
)和来保护write_sequnlock(
&xtime_lock)。请注意,该++jiffies_64指令还增加了 32 位jiffies变量,因为后者对应于 的下半部分jiffies_64。
Conversely, the critical region increasing the jiffies_64 variable must be protected by
means of write_seqlock(&xtime_lock
) and write_sequnlock(
&xtime_lock). Notice that the ++jiffies_64 instruction also increases
the 32-bit jiffies variable,
because the latter corresponds to the lower half of jiffies_64.
该xtime变量存储当前时间和日期;它是一个具有两个字段的类型结构timespec:
The xtime variable
stores the current time and date; it is a structure of type timespec having two fields:
tv_sectv_sec存储自 1970 年 1 月 1 日午夜 (UTC) 以来经过的秒数
Stores the number of seconds that have elapsed since midnight of January 1, 1970 (UTC)
tv_nsectv_nsec存储上一秒内经过的纳秒数(其值范围在 0 到 999,999,999 之间)
Stores the number of nanoseconds that have elapsed within the last second (its value ranges between 0 and 999,999,999)
该xtime变量通常每一次更新一次,即每秒大约 1000 次。正如我们将在后面的“与计时测量相关的系统调用”部分中看到的,用户程序从变量中获取当前时间和日期xtime。内核也经常引用它,例如,在更新 inode 时间戳时(请参阅第 1 章中的“文件描述符和 Inode ”部分)。
The xtime variable is
usually updated once in a tick—that is, roughly 1000 times per
second. As we'll see in the later section "System Calls Related to Timing
Measurements," user programs get the current time and date
from the xtime variable. The
kernel also often refers to it, for instance, when updating inode
timestamps (see the section "File Descriptor and
Inode" in Chapter
1).
seqlockxtime_lock避免了由于并发访问变量而可能发生的竞争条件xtime。请记住,这xtime_lock也保护了jiffies_64变量;一般来说,这个 seqlock 用于定义计时架构的几个关键区域。
The xtime_lock seqlock
avoids the race conditions that could occur due to concurrent
accesses to the xtime variable.
Remember that xtime_lock also
protects the jiffies_64 variable;
in general, this seqlock is used to define several critical regions
of the timekeeping architecture.
在单处理器系统中,所有与时间相关的活动都是由 IRQ 线 0 上的可编程间隔定时器引发的中断触发的。通常,在 Linux 中,其中一些活动在中断引发后立即执行,而其余的活动由可延迟函数执行(参见后面的“动态定时器”部分)。
In a uniprocessor system, all time-related activities are triggered by the interrupts raised by the Programmable Interval Timer on IRQ line 0. As usual, in Linux, some of these activities are executed as soon as possible right after the interrupt is raised, while the remaining activities are carried on by deferrable functions (see the later section "Dynamic Timers").
在内核初始化期间,time_init( )调用该函数来设置计时架构。它通常[ * ]执行以下操作:
During kernel initialization, the time_init( ) function is invoked to set up
the timekeeping architecture. It usually[*] performs the following operations:
初始化xtime
变量。通过该
get_cmos_time( )函数从实时时钟读取自 1970 年 1 月 1 日午夜以来经过的秒数。tv_nsec设置了的字段,xtime以便即将到来的jiffies
变量溢出将与字段的增量一致tv_sec,也就是说,它将落在第二个边界上。
Initializes the xtime
variable. The number of seconds elapsed since the midnight of
January 1, 1970 is read from the Real Time Clock by means of the
get_cmos_time( ) function.
The tv_nsec field of xtime is set, so that the forthcoming
overflow of the jiffies
variable will coincide with an increment of the tv_sec field—that is, it will fall on
a second boundary.
初始化wall_to_monotonic变量。该变量与 的类型相同timespec,xtime它本质上存储要添加的秒数和纳秒数xtime,以获得单调(不断增加)的时间流。事实上,闰秒和与外部时钟的同步都可能突然改变
tv_sec和 的tv_nsec字段xtime,使它们不再单调增加。正如我们将在后面的“ POSIX 定时器的系统调用”部分中看到的,有时内核需要一个真正单调的时间源。
Initializes the wall_to_monotonic variable. This
variable is of the same type timespec as xtime, and it essentially stores the
number of seconds and nanoseconds to be added to xtime in order to get a monotonic
(ever increasing) flow of time. In fact, both leap seconds and
synchronization with external clocks might suddenly change the
tv_sec and tv_nsec fields of xtime so that they are no longer
monotonically increased. As we'll see in the later section
"System Calls for
POSIX Timers," sometimes the kernel needs a truly
monotonic time source.
如果内核支持HPET,它会调用该hpet_enable( )函数来确定ACPI是否支持固件已探测芯片并将其寄存器映射到内存地址空间。如果是,则
hpet_enable( )对 HPET 芯片的第一个定时器进行编程,使其每秒引发 IRQ 0 中断 1000 次。否则,如果 HPET 芯片不可用,内核将使用 PIT:该芯片已被函数编程为init_IRQ(
)每秒引发 1000 个定时器中断,如前面的“可编程间隔定时器 (PIT) ”部分所述。
If the kernel supports HPET, it invokes the hpet_enable( ) function to determine
whether the ACPI firmware has probed the chip and mapped its
registers in the memory address space. In the affirmative case,
hpet_enable( ) programs the
first timer of the HPET chip so that it raises the IRQ 0
interrupt 1000 times per second. Otherwise, if the HPET chip is
not available, the kernel will use the PIT: the chip has already
been programmed by the init_IRQ(
) function to raise 1000 timer interrupts per second,
as described in the earlier section "Programmable Interval
Timer (PIT)."
调用select_timer( )
以选择系统中可用的最佳计时器源,并将变量设置cur_timer为相应计时器对象的地址。
Invokes select_timer( )
to select the best timer source available in the system, and
sets the cur_timer variable
to the address of the corresponding timer object.
调用setup_irq(
0,&irq0)以设置与 IRQ0 对应的中断门——与系统定时器中断源(PIT 或 HPET)关联的线路。该irq0变量静态定义为:
struct irqaction irq0 = { 定时器中断, SA_INTERRUPT, 0,
“计时器”,NULL,NULL};从现在开始,该timer_interrupt(
)函数将在禁用中断的情况下每个时钟周期调用一次,因为statusIRQ 0 的主描述符字段已SA_INTERRUPT设置标志。
Invokes setup_irq(
0,&irq0) to set up the interrupt gate
corresponding to IRQ0—the line associated with the system timer
interrupt source (PIT or HPET).The irq0 variable is statically defined
as:
struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT, 0,
"timer", NULL, NULL };From now on, the timer_interrupt(
) function will be invoked once every tick with
interrupts disabled, because the status field of IRQ 0's main
descriptor has the SA_INTERRUPT flag set.
该timer_interrupt(
)函数是 PIT 或 HPET 的中断服务程序(ISR);它执行以下步骤:
The timer_interrupt(
) function is the interrupt service routine (ISR) of the
PIT or of the HPET; it performs the following steps:
write_seqlock()通过在 seqlock 上
发出 a 来保护与时间相关的内核变量
(请参阅第 5 章中的“ Seqlocks ”xtime_lock部分)。
Protects the time-related kernel variables by issuing a
write_seqlock() on the
xtime_lock seqlock (see the
section "Seqlocks" in Chapter 5).
执行计时器对象mark_offset的方法cur_timer。正如前面“计时架构的数据结构”一节中所解释的,有四种可能的情况:
cur_timer指向timer_hpet对象:在这种情况下,HPET芯片是定时器中断的源。该mark_offset方法检查自上次滴答以来没有丢失定时器中断;在这种不太可能的情况下,它会jiffies_64相应更新。接下来,该方法记录周期性 HPET 计数器的当前值。
cur_timer指向timer_pmtmr对象:在这种情况下,PIT芯片是定时器中断的来源,但内核使用APIC电源管理定时器以更精细的分辨率来测量时间。该mark_offset方法检查自上次滴答以来没有丢失计时器中断,并jiffies_64在必要时进行更新。然后,它记录 APIC 电源管理定时器计数器的当前值。
cur_timer指向timer_tsc对象:在这种情况下,PIT芯片是定时器中断的来源,但内核使用时间戳计数器以更精细的分辨率来测量时间。该mark_offset方法执行与前一种情况相同的操作:它检查自上次滴答以来没有丢失计时器中断,并jiffies_64在必要时进行更新。然后,它记录 TSC 计数器的当前值。
cur_timer指向timer_pit对象:本例中PIT芯片是定时器中断源,没有其他定时器电路。该mark_offset方法什么也不做。
Executes the mark_offset method of the cur_timer timer object. As explained
in the earlier section "Data Structures of the
Timekeeping Architecture," there are four possible
cases:
cur_timer points to
the timer_hpet object: in
this case, the HPET chip is the source of timer interrupts.
The mark_offset method
checks that no timer interrupt has been lost since the last
tick; in this unlikely case, it updates jiffies_64 accordingly. Next, the
method records the current value of the periodic HPET
counter.
cur_timer points to
the timer_pmtmr object:
in this case, the PIT chip is the source of timer
interrupts, but the kernel uses the APIC Power Management
Timer to measure time with a finer resolution. The mark_offset method checks that no
timer interrupt has been lost since the last tick and
updates jiffies_64 if
necessary. Then, it records the current value of the APIC
Power Management Timer counter.
cur_timer points to
the timer_tsc object: in
this case, the PIT chip is the source of timer interrupts,
but the kernel uses the Time Stamp Counter to measure time
with a finer resolution. The mark_offset method performs the
same operations as in the previous case: it checks that no
timer interrupt has been lost since the last tick and
updates jiffies_64 if
necessary. Then, it records the current value of the TSC
counter.
cur_timer points to
the timer_pit object: in
this case, the PIT chip is the source of timer interrupts,
and there is no other timer circuit. The mark_offset method does
nothing.
调用该do_timer_interrupt(
)函数,该函数依次执行以下操作:
的值增加一jiffies_64。请注意,这可以安全地完成,因为内核控制路径仍然保留xtime_lock用于写入的 seqlock。
调用update_times(
)函数更新系统日期和时间并计算当前系统负载;这些活动将在稍后的“更新时间和日期”和“更新系统统计信息”部分中讨论。
调用该update_process_times( )函数为本地 CPU 执行多个与时间相关的记帐操作(请参阅本章后面的“更新本地 CPU 统计信息”部分)。
调用该函数(请参阅本章后面的“分析内核代码profile_tick(
)”部分)。
如果系统时钟与外部时钟同步(adjtimex( )
之前已发出系统调用),则
set_rtc_mmss( )每 660 秒(每 11 分钟)调用一次该函数来调整实时时钟。此功能帮助网络上的系统同步其时钟(请参阅后面的“ adjtimex( ) 系统调用”部分)。
Invokes the do_timer_interrupt(
) function, which in turn performs the following
actions:
Increases by one the value of jiffies_64. Notice that this can
be done safely, because the kernel control path still holds
the xtime_lock seqlock
for writing.
Invokes the update_times(
) function to update the system date and time and
to compute the current system load; these activities are
discussed later in the sections "Updating the Time and
Date" and "Updating System Statistics."
Invokes the update_process_times( ) function
to perform several time-related accounting operations for
the local CPU (see the section "Updating Local CPU
Statistics" later in this chapter).
Invokes the profile_tick(
) function (see the section "Profiling the Kernel
Code" later in this chapter).
If the system clock is synchronized with an external
clock (an adjtimex( )
system call has been previously issued), invokes the
set_rtc_mmss( ) function
once every 660 seconds (every 11 minutes) to adjust the Real
Time Clock. This feature helps systems on a network
synchronize their clocks (see the later section "The adjtimex( ) System
Call").
xtime_lock
通过调用释放seqlock write_sequnlock()。
Releases the xtime_lock
seqlock by invoking write_sequnlock().
Returns the value 1 to notify that the interrupt has been effectively handled (see the section "I/O Interrupt Handling" in Chapter 4).
多处理器系统可以依赖两种不同的定时器中断源:由可编程间隔定时器或高精度事件定时器引发的中断,以及由 CPU 本地定时器引发的中断。
Multiprocessor systems can rely on two different sources of timer interrupts: those raised by the Programmable Interval Timer or the High Precision Event Timer, and those raised by the CPU local timers.
在 Linux 2.6 中,全局计时器中断(由 PIT 或 HPET 引发)发出与特定 CPU 无关的信号活动,例如处理软件计时器和保持系统时间最新。相反,CPU 本地计时器中断向与本地 CPU 相关的计时活动发出信号,例如监视当前进程运行了多长时间以及更新资源使用统计信息。
In Linux 2.6, global timer interrupts—raised by the PIT or the HPET—signal activities not related to a specific CPU, such as handling of software timers and keeping the system time up-to-date. Conversely, a CPU local timer interrupt signals timekeeping activities related to the local CPU, such as monitoring how long the current process has been running and updating the resource usage statistics.
全局定时器中断处理程序由该函数初始化
time_init( ),该函数已在前面的“单处理器系统中的计时体系结构”部分中进行了描述。
The global timer interrupt handler is initialized by the
time_init( ) function, which has
already been described in the earlier section "Timekeeping Architecture in
Uniprocessor Systems."
Linux 内核0xef为本地定时器中断保留中断向量 239 ( )(参见
第 4 章表4-2)。在内核初始化期间,该函数使用低级中断处理程序的地址设置与向量 239 相对应的 IDT 中断门
。此外,必须告知每个 APIC 多久生成一次本地时间中断。该函数计算启动 CPU 的本地 APIC 在一个时钟节拍 (1 ms) 期间接收了多少个总线时钟信号。然后使用该精确值对本地 APIC 进行编程,以便在每个时钟周期生成一个本地定时器中断。这是由apic_intr_init(
)apic_timer_interrupt( )calibrate_APIC_clock(
)setup_APIC_timer( )
函数,系统中的每个 CPU 都会执行一次。
The Linux kernel reserves the interrupt vector 239 (0xef) for local timer interrupts (see
Table 4-2 in Chapter 4). During kernel
initialization, the apic_intr_init(
) function sets up the IDT's interrupt gate corresponding
to vector 239 with the address of the low-level interrupt handler
apic_timer_interrupt( ).
Moreover, each APIC has to be told how often to generate a local
time interrupt. The calibrate_APIC_clock(
) function computes how many bus clock signals are
received by the local APIC of the booting CPU during a tick (1 ms).
This exact value is then used to program the local APICs in such a
way to generate one local timer interrupt every tick. This is done
by the setup_APIC_timer( )
function, which is executed once for every CPU in the system.
所有本地 APIC 定时器都是同步的,因为它们基于公共总线时钟信号。这意味着为启动 CPU 计算的值calibrate_APIC_clock( )也适用于系统中的其他 CPU。
All local APIC timers are synchronized because they are based
on the common bus clock signal. This means that the value computed
by calibrate_APIC_clock( ) for
the boot CPU is also good for the other CPUs in the system.
SMP 版本的timer_interrupt()处理程序与 UP 版本有以下几点不同:
The SMP version of the timer_interrupt() handler differs from the
UP version in a few points:
该do_timer_interrupt(
)函数由 调用timer_interrupt( ),写入 I/O APIC 芯片的端口以确认定时器 IRQ。
The do_timer_interrupt(
) function, invoked by timer_interrupt( ), writes into a port
of the I/O APIC chip to acknowledge the timer IRQ.
该update_process_times(
)函数未被调用,因为该函数执行与特定CPU相关的操作。
The update_process_times(
) function is not invoked, because this function
performs actions related to a specific CPU.
该profile_tick( )
函数不会被调用,因为该函数还执行与特定 CPU 相关的操作。
The profile_tick( )
function is not invoked, because this function also performs
actions related to a specific CPU.
该处理程序执行与系统中特定 CPU 相关的计时活动,即分析内核代码并检查当前进程在给定 CPU 上运行了多长时间。
This handler performs the timekeeping activities related to a specific CPU in the system, namely profiling the kernel code and checking how long the current process has been running on a given CPU.
汇编apic_timer_interrupt( )
语言函数等价于以下代码:
The apic_timer_interrupt( )
assembly language function is equivalent to the following
code:
apic_定时器_中断:
普什尔$(239-256)
保存全部
movl %esp, %eax
调用 smp_apic_timer_interrupt
jmp ret_from_intr apic_timer_interrupt:
pushl $(239-256)
SAVE_ALL
movl %esp, %eax
call smp_apic_timer_interrupt
jmp ret_from_intr正如您所看到的,低级处理程序与第 4 章中描述的其他低级中断处理程序非常相似。调用的高级中断处理程序smp_apic_timer_interrupt( )执行以下步骤:
As you can see, the low-level handler is very similar to the
other low-level interrupt handlers already described in Chapter 4. The high-level
interrupt handler called smp_apic_timer_interrupt( ) executes the
following steps:
获取 CPU 逻辑编号(例如 n)。
Gets the CPU logical number (say, n).
增加数组第 n个条目的apic_timer_irqs字段
(请参阅本章后面的“检查 NMI 看门狗”部分)。 irq_stat
Increases the apic_timer_irqs field of the
n th entry of
the irq_stat array (see the
section "Checking
the NMI Watchdogs" later in this chapter).
确认本地 APIC 上的中断。
Acknowledges the interrupt on the local APIC.
调用该函数(请参阅第 4 章中的“ do_IRQ( ) 函数”irq_enter( )
部分)。
Calls the irq_enter( )
function (see the section "The do_IRQ( )
function" in Chapter
4).
调用该smp_local_timer_interrupt( )
函数。
Invokes the smp_local_timer_interrupt( )
function.
调用该irq_exit( )
函数。
Calls the irq_exit( )
function.
该smp_local_timer_interrupt(
)函数执行每个 CPU 的计时活动。实际上,它执行以下主要步骤:
The smp_local_timer_interrupt(
) function executes the per-CPU timekeeping activities.
Actually, it performs the following main steps:
调用该函数(请参阅本章后面的“分析内核代码profile_tick(
)”部分)。
Invokes the profile_tick(
) function (see the section "Profiling the Kernel
Code" later in this chapter).
调用该update_process_times(
)函数来检查当前进程已经运行了多长时间并更新一些本地 CPU 统计信息(请参阅本章后面的“更新本地 CPU 统计信息”部分)。
Invokes the update_process_times(
) function to check how long the current process has
been running and to update some local CPU statistics (see the
section "Updating
Local CPU Statistics" later in this chapter).
系统管理员可以通过写入/proc/profile文件来更改内核代码分析器的采样频率。为了进行更改,内核会修改生成本地计时器中断的频率。但是,该smp_local_timer_interrupt( )函数会在每个时钟周期精确地调用该update_process_times(
)函数一次。
The system administrator can change the sample frequency of
the kernel code profiler by writing into the /proc/profile file.To carry out the
change, the kernel modifies the frequency at which local timer
interrupts are generated. However, the smp_local_timer_interrupt( ) function
keeps invoking the update_process_times(
) function exactly once every tick.
[ * ]该函数在初始化内存数据结构time_init( )
之前执行。mem_init(
)不幸的是,HPET 寄存器是内存映射的,因此 HPET 芯片的初始化必须在执行后完成mem_init( )。Linux 2.6采用了一个繁琐的解决方案:如果内核支持HPET芯片,则该time_init(
)函数限制自身触发该hpet_time_init( )
函数的激活。后一个函数mem_init( )在执行本节中描述的操作之后执行。
[*] The time_init( )
function is executed before mem_init(
), which initializes the memory data structures.
Unfortunately, the HPET registers are memory mapped, therefore
initialization of the HPET chip has to be done after the
execution of mem_init( ).
Linux 2.6 adopts a cumbersome solution: if the kernel supports
the HPET chip, the time_init(
) function limits itself to trigger the activation of
the hpet_time_init( )
function.The latter function is executed after mem_init( ) and performs the
operations described in this section.
用户程序从变量中获取当前时间和日期
xtime。内核必须定期更新此变量,以便其值始终相当准确。
User programs get the current time and date from the
xtime variable. The kernel must
periodically update this variable, so that its value is always
reasonably accurate.
该update_times( )函数由全局定时器中断处理程序调用,更新变量的值,xtime如下所示:
The update_times( ) function,
which is invoked by the global timer interrupt handler, updates the
value of the xtime variable as
follows:
无效更新次数(无效)
{
无符号长刻度;
刻度= jiffies - wall_jiffies;
如果(勾选){
wall_jiffies += 刻度线;
update_wall_time(刻度);
}
calc_load(刻度);
} void update_times(void)
{
unsigned long ticks;
ticks = jiffies - wall_jiffies;
if (ticks) {
wall_jiffies += ticks;
update_wall_time(ticks);
}
calc_load(ticks);
}我们回想一下前面对定时器中断处理程序的描述,当执行该函数的代码时,xtime_lock已经获取了 seqlock 进行写入。
We recall from the previous description of the timer interrupt
handler that when the code of this function is executed, the xtime_lock seqlock has already been acquired
for writing.
该wall_jiffies变量存储了该变量最后一次更新的时间xtime。请注意, 的值
wall_jiffies可能小于
jiffies-1,因为一些定时器中断可能会丢失,例如当中断长时间保持禁用状态时;换句话说,内核不一定xtime在每个时钟周期都更新变量。然而,没有任何蜱虫会彻底丢失,并且从长远来看,
xtime会存储正确的系统时间。检查丢失的定时器中断是通过以下mark_offset方法完成的cur_timer:请参阅前面的“单处理器系统中的计时体系结构”部分。
The wall_jiffies variable
stores the time of the last update of the xtime variable. Observe that the value of
wall_jiffies can be smaller than
jiffies-1, since a few timer
interrupts can be lost, for instance when interrupts remain disabled for
a long period of time; in other words, the kernel does not necessarily
update the xtime variable at every
tick. However, no tick is definitively lost, and in the long run,
xtime stores the correct system time.
The check for lost timer interrupts is done in the mark_offset method of cur_timer; see the earlier section "Timekeeping Architecture in
Uniprocessor Systems."
该函数
连续update_wall_time( )
调用该update_wall_time_one_tick(
)函数多次;ticks通常,每次调用都会向该
xtime.tv_nsec字段添加 1,000,000。如果 的值
xtime.tv_nsec大于 999,999,999,该update_wall_time( )
函数还会更新tv_sec
的字段xtime。如果已发出系统调用,由于本章后面的“ adjtimex( ) 系统调用adjtimex( )”部分中解释的原因,该函数可能会稍微调整值 1,000,000,以便时钟稍微加快或减慢。
The update_wall_time( )
function invokes the update_wall_time_one_tick(
) function ticks
consecutive times; normally, each invocation adds 1,000,000 to the
xtime.tv_nsec field. If the value of
xtime.tv_nsec becomes greater than
999,999,999, the update_wall_time( )
function also updates the tv_sec
field of xtime. If an adjtimex( ) system call has been issued, for
reasons explained in the section "The adjtimex( ) System
Call" later in this chapter, the function might tune the value
1,000,000 slightly so the clock speeds up or slows down a little.
该功能将在本章后面的“跟踪系统负载calc_load( )”部分中进行描述。
The calc_load( ) function is
described in the section "Keeping Track of System
Load" later in this chapter.
除了其他与时间相关的职责之外,内核还必须定期收集用于以下目的的各种数据:
The kernel, among the other time-related duties, must periodically collect various data used for:
检查正在运行的进程的CPU资源限制
Checking the CPU resource limit of the running processes
更新有关本地 CPU 工作负载的统计信息
Updating statistics about the local CPU workload
计算平均系统负载
Computing the average system load
分析内核代码
Profiling the kernel code
我们已经提到过,该update_process_times( )函数是由单处理器系统上的全局定时器中断处理程序或多处理器系统中的本地定时器中断处理程序调用的,以更新一些内核统计信息。该函数执行以下步骤:
We have mentioned that the update_process_times( ) function is
invoked—either by the global timer interrupt handler on uniprocessor
systems or by the local timer interrupt handler in multiprocessor
systems—to update some kernel statistics. This function performs the
following steps:
检查当前进程已经运行了多长时间。根据定时器中断发生时当前进程是在用户模式还是内核模式下运行,调用account_user_time( )或
account_system_time( )。每个函数本质上执行以下步骤:
更新当前进程描述符的utime字段(在用户模式中花费的时钟)或字段(在内核模式中花费的时钟)。stime进程描述符中提供了两个附加字段,分别称为cutime和cstime,用于计算子进程在用户模式和内核模式下分别花费的 CPU 滴答数。出于效率的原因,这些字段不是由 更新
update_process_times( ),而是在父进程查询其子进程之一的状态时更新(请参阅第 3 章中的“销毁进程”一节)。
检查CPU总时间是否达到限制;如果是,则发送SIGXCPU信号SIGKILL至current。第 3 章中的“进程资源限制”部分描述了如何通过
每个进程描述符的字段来控制限制。signal->rlim[RLIMIT_CPU].rlim_cur
调用account_it_virt(
)和检查进程计时器(请参阅本章后面的“ setitimer( ) 和alarm( ) 系统调用account_it_prof(
)”部分)。
kstat更新存储在每个 CPU 变量中的一些内核统计信息。
Checks how long the current process has been running.
Depending on whether the current process was running in User Mode
or in Kernel Mode when the timer interrupt occurred, invokes
either account_user_time( ) or
account_system_time( ). Each of
these functions performs essentially the following steps:
Updates either the utime field (ticks spent in User
Mode) or the stime field
(ticks spent in Kernel Mode) of the current process
descriptor. Two additional fields called cutime and cstime are provided in the process
descriptor to count the number of CPU ticks spent by the
process children in User Mode and Kernel Mode, respectively.
For reasons of efficiency, these fields are not updated by
update_process_times( ),
but rather when the parent process queries the state of one of
its children (see the section "Destroying
Processes" in Chapter 3).
Checks whether the total CPU time limit has been
reached; if so, sends SIGXCPU and SIGKILL signals to current. The section "Process Resource
Limits" in Chapter
3 describes how the limit is controlled by the signal->rlim[RLIMIT_CPU].rlim_cur
field of each process descriptor.
Invokes account_it_virt(
) and account_it_prof(
) to check the process timers (see the section
"The setitimer( )
and alarm( ) System Calls" later in this
chapter).
Updates some kernel statistics stored in the kstat per-CPU variable.
调用raise_softirq( )
以激活本地 CPU 上的微线程(请参阅本章后面的“软件定时器和延迟函数TIMER_SOFTIRQ”
部分)。
Invokes raise_softirq( )
to activate the TIMER_SOFTIRQ
tasklet on the local CPU (see the section "Software Timers and Delay
Functions" later in this chapter).
如果必须回收受 RCU 保护的数据结构的某些旧版本,请检查本地 CPU 是否已进入静止状态并调用激活tasklet_schedule( )本地
rcu_taskletCPU 的 tasklet(请参阅“读-复制更新 (RCU) ”部分)第5 章)。
If some old version of an RCU-protected data structure has
to be reclaimed, checks whether the local CPU has gone through a
quiescent state and invokes tasklet_schedule( ) to activate the
rcu_tasklet tasklet of the
local CPU (see the section "Read-Copy Update
(RCU)" in Chapter
5).
调用该scheduler_tick(
)函数,减少当前进程的时间片计数器,并检查其时间片是否耗尽。我们将在第 7 章的“ scheduler_tick() 函数”部分深入讨论这些操作。
Invokes the scheduler_tick(
) function, which decreases the time slice counter of
the current process, and checks whether its quantum is exhausted.
We'll discuss in depth these operations in the section "The scheduler_tick( )
Function" in Chapter
7.
每个 Unix 内核都会跟踪系统正在执行的 CPU 活动量。这些统计数据由各种管理实用程序使用,例如top. 输入该命令的用户uptime看到的统计信息是相对于最后一分钟、最后 5 分钟和最后 15 分钟的“平均负载”。在单处理器系统上,值 0 表示没有活动进程(除了交换
器进程 0 之外)运行,而值 1 表示 CPU 100% 忙于单个进程,值大于 1 表示CPU 在多个活动进程之间共享。[ * ]
Every Unix kernel keeps track of how much CPU activity
is being carried on by the system. These statistics are used by
various administration utilities such as top. A user who enters the uptime command sees the statistics as the
"load average" relative to the last minute, the last 5 minutes, and
the last 15 minutes. On a uniprocessor system, a value of 0 means that
there are no active processes (besides the
swapper process 0) to run, while a value of 1
means that the CPU is 100 percent busy with a single process, and
values greater than 1 mean that the CPU is shared among several active
processes.[*]
在每个刻度处,update_times(
)调用该函数,该函数计算处于或状态calc_load(
)的进程数
,并使用该数字来更新平均系统负载。TASK_RUNNINGTASK_UNINTERRUPTIBLE
At every tick, update_times(
) invokes the calc_load(
) function, which counts the number of processes in the
TASK_RUNNING or TASK_UNINTERRUPTIBLE state and uses this
number to update the average system load.
Linux 包含一个名为readprofile的极简代码分析器, Linux 开发人员使用它来发现内核在内核模式下的时间花在哪里。分析器识别内核的热点——最常执行的内核代码片段。识别内核热点非常重要,因为它们可能会指出应该进一步优化的内核函数。
Linux includes a minimalist code profiler called readprofile used by Linux developers to discover where the kernel spends its time in Kernel Mode. The profiler identifies the hot spots of the kernel — the most frequently executed fragments of kernel code. Identifying the kernel hot spots is very important, because they may point out kernel functions that should be further optimized.
分析器基于简单的蒙特卡罗算法:每次定时器中断发生时,内核都会确定中断是否发生在内核模式下;eip如果是,内核在中断之前从堆栈中获取寄存器的值,并使用它来发现内核在中断之前正在做什么。从长远来看,样本会积累在热点上。
The profiler is based on a simple Monte Carlo algorithm: at
every timer interrupt occurrence, the kernel determines whether the
interrupt occurred in Kernel Mode; if so, the kernel fetches the value
of the eip register before the
interruption from the stack and uses it to discover what the kernel
was doing before the interrupt. In the long run, the samples
accumulate on the hot spots.
该profile_tick( )函数为代码分析器收集数据。do_timer_interrupt( )它可以由单处理器系统中的函数(由全局定时器中断处理程序)或多
smp_local_timer_interrupt( )
处理器系统中的函数(由本地定时器中断处理程序)调用
。
The profile_tick( ) function
collects the data for the code profiler. It is invoked either by the
do_timer_interrupt( ) function in
uniprocessor systems (by the global timer interrupt handler) or by the
smp_local_timer_interrupt( )
function in multiprocessor systems (by the local timer interrupt
handler).
要启用代码分析器,必须通过传递字符串作为参数来引导 Linux 内核profile=N,其中 2 N表示要分析的代码片段的大小。收集的数据可以从/proc/profile文件中读取。通过写入同一文件来重置计数器;在多处理器系统中,写入文件还可以更改采样频率(请参阅前面的“多处理器系统中的计时体系结构”部分)。然而,内核开发人员通常不会
直接访问/proc/profile ;相反,他们使用readprofile系统命令。
To enable the code profiler, the Linux kernel must be booted by
passing as a parameter the string profile=N, where
2N denotes the size of the code fragments
to be profiled. The collected data can be read from the /proc/profile file. The counters are reset
by writing in the same file; in multiprocessor systems, writing into
the file can also change the sample frequency (see the earlier section
"Timekeeping Architecture
in Multiprocessor Systems"). However, kernel developers do not
usually access /proc/profile
directly; instead, they use the readprofile system command.
Linux 2.6 内核还包含另一个名为
oprofile的分析器。除了比readprofile更灵活和可定制之外,
oprofile还可用于发现内核代码、用户模式应用程序和系统库中的热点。使用oprofile时
,profile_tick( )调用该timer_notify( )函数来收集此新分析器使用的数据。
The Linux 2.6 kernel includes yet another profiler called
oprofile. Besides being more flexible and
customizable than readprofile,
oprofile can be used to discover hot spots in
kernel code, User Mode applications, and system libraries. When
oprofile is being used, profile_tick( ) invokes the timer_notify( ) function to collect the data
used by this new profiler.
在多处理器系统中,Linux 为内核开发人员提供了另一个功能:看门狗系统
,这对于检测导致系统冻结的内核错误可能非常有用。要激活这样的看门狗,必须使用该参数引导内核nmi_watchdog。
In multiprocessor systems, Linux offers yet another
feature to kernel developers: a watchdog system
, which might be quite useful to detect kernel bugs
that cause a system freeze. To activate such a watchdog, the kernel
must be booted with the nmi_watchdog parameter.
看门狗基于本地和 I/O APIC 的巧妙硬件功能:它们可以生成周期性 NMI 中断在每个CPU上。因为 NMI 中断没有被屏蔽cli 通过汇编语言指令,即使中断被禁用,看门狗也可以检测死锁。
The watchdog is based on a clever hardware feature of local and
I/O APICs: they can generate periodic NMI interrupts on every CPU. Because NMI interrupts are not masked by
the cli assembly language instruction, the watchdog can detect
deadlocks even when interrupts are disabled.
因此,一旦每个时钟周期,所有 CPU,无论它们在做什么,都会开始执行 NMI 中断处理程序;反过来,处理程序调用do_nmi( ). 该函数获取CPU的逻辑数n,然后检查apic_timer_irqs
第n 项
的字段irq_stat(参见第4章表4-8)。如果CPU工作正常,该值必须与之前NMI中断时读取的值不同。当CPU正常运行时,该字段
的第n个条目
由本地定时器中断处理程序增加(参见前面的部分“ apic_timer_irqs本地定时器中断处理程序");如果计数器没有增加,本地定时器中断处理程序在整个tick中还没有被执行。这不是一件好事,你知道。
As a consequence, once every tick, all CPUs, regardless of what
they are doing, start executing the NMI interrupt handler; in turn,
the handler invokes do_nmi( ). This
function gets the logical number n of the CPU,
and then checks the apic_timer_irqs
field of the n th
entry of irq_stat (see Table 4-8 in Chapter 4). If the CPU is working
properly, the value must be different from the value read at the
previous NMI interrupt. When the CPU is running properly, the
n th entry of the
apic_timer_irqs field is increased
by the local timer interrupt handler (see the earlier section "The local timer interrupt
handler"); if the counter is not increased, the local timer
interrupt handler has not been executed in a whole tick. Not a good
thing, you know.
当 NMI 中断处理程序检测到 CPU 冻结时,它会敲响所有警钟:它会在系统日志文件中记录可怕的消息,转储 CPU 寄存器和内核堆栈的内容(内核错误),最后终止当前进程。这使内核开发人员有机会发现问题所在。
When the NMI interrupt handler detects a CPU freeze, it rings all the bells: it logs scary messages in the system logfiles, dumps the contents of the CPU registers and of the kernel stack (kernel oops), and finally kills the current process. This gives kernel developers a chance to discover what's gone wrong.
[ * ] Linux 在平均负载中包含所有处于TASK_RUNNING和TASK_UNINTERRUPTIBLE状态的进程。然而,正常情况下,进程很少TASK_UNINTERRUPTIBLE,因此高负载通常意味着CPU很忙。
[*] Linux includes in the load average all processes that are in
the TASK_RUNNING and TASK_UNINTERRUPTIBLE states. However,
under normal conditions, there are few TASK_UNINTERRUPTIBLE processes, so a
high load usually means that the CPU is busy.
定时器是一种软件工具,允许在给定的时间间隔过去后在未来的某个时刻调用函数;超时表示与计时器相关的时间间隔已经过去的时刻。
A timer is a software facility that allows functions to be invoked at some future moment, after a given time interval has elapsed; a time-out denotes a moment at which the time interval associated with a timer has elapsed.
定时器被内核和进程广泛使用。大多数设备驱动程序都使用计时器检测异常情况 - 例如,软盘驱动程序使用计时器在软盘一段时间未访问后关闭设备电机,并行打印机驱动程序使用它们来检测错误的打印机情况。
Timers are widely used both by the kernel and by processes. Most device drivers use timers to detect anomalous conditions — floppy disk drivers, for instance, use timers to switch off the device motor after the floppy has not been accessed for a while, and parallel printer drivers use them to detect erroneous printer conditions.
程序员也经常使用计时器来强制在将来的某个时间执行特定的函数(请参阅后面的“ setitimer() 和alarm() 系统调用”部分)。
Timers are also used quite often by programmers to force the execution of specific functions at some future time (see the later section "The setitimer( ) and alarm( ) System Calls").
实现计时器相对容易。每个计时器都包含一个字段,指示计时器在未来多久到期。该字段最初是通过将正确的刻度数添加到 的当前值来计算的jiffies。该字段不会改变。jiffies每次内核检查计时器时,都会将到期字段与当前时刻的值进行比较,当jiffies大于或等于存储的值时计时器到期。
Implementing a timer is relatively easy. Each timer contains a
field that indicates how far in the future the timer should expire. This
field is initially calculated by adding the right number of ticks to the
current value of jiffies. The field
does not change. Every time the kernel checks a timer, it compares the
expiration field to the value of jiffies at the current moment, and the timer
expires when jiffies is greater than
or equal to the stored value.
Linux 考虑两种类型的计时器,称为动态计时器 和间隔定时器 。第一种类型由内核使用,而间隔计时器可以由用户模式下的进程创建。
Linux considers two types of timers called dynamic timers and interval timers . The first type is used by the kernel, while interval timers may be created by processes in User Mode.
关于 Linux 定时器的一点警告:由于检查定时器函数总是由可延迟函数完成,这些函数可能在激活后很长一段时间内执行,因此内核无法确保定时器函数将在其到期时间正确启动。它只能确保它们在适当的时间执行,或者在延迟最多几百毫秒之后执行。因此,计时器不适合必须严格执行过期时间的实时应用程序。
One word of caution about Linux timers: since checking for timer functions is always done by deferrable functions that may be executed a long time after they have been activated, the kernel cannot ensure that timer functions will start right at their expiration times. It can only ensure that they are executed either at the proper time or after with a delay of up to a few hundreds of milliseconds. For this reason, timers are not appropriate for real-time applications in which expiration times must be strictly enforced.
除了软件定时器之外,内核还利用了延迟函数 ,执行紧密的指令循环,直到给定的时间间隔过去。我们将在后面的“延迟函数”部分中讨论它们。
Besides software timers , the kernel also makes use of delay functions , which execute a tight instruction loop until a given time interval elapses. We will discuss them in the later section "Delay Functions."
动态定时器可以动态地创建和销毁。当前活动的动态计时器的数量没有限制。
Dynamic timers may be dynamically created and destroyed. No limit is placed on the number of currently active dynamic timers.
动态定时器存储在以下timer_list结构中:
A dynamic timer is stored in the following timer_list structure:
结构定时器列表{
struct list_head 条目;
无符号长过期;
spinlock_t 锁;
无符号长魔法;
void (*函数)(无符号长整型);
无符号长数据;
tvec_base_t *基;
}; struct timer_list {
struct list_head entry;
unsigned long expires;
spinlock_t lock;
unsigned long magic;
void (*function)(unsigned long);
unsigned long data;
tvec_base_t *base;
};该function字段包含定时器到期时要执行的函数的地址。该
data字段指定要传递给此计时器函数的参数。由于该data字段,可以定义一个通用函数来处理多个设备驱动程序的超时;该data字段可以存储设备 ID 或其他有意义的数据,该功能可以使用这些数据来区分设备。
The function field contains
the address of the function to be executed when the timer expires. The
data field specifies a parameter to
be passed to this timer function. Thanks to the data field, it is possible to define a
single general-purpose function that handles the time-outs of several
device drivers; the data field
could store the device ID or other meaningful data that could be used
by the function to differentiate the device.
该expires字段指定定时器何时到期;时间表示为自系统启动以来经过的滴答数。所有
expires值小于或等于 值的计时器jiffies都被视为过期或衰减。
The expires field specifies
when the timer expires; the time is expressed as the number of ticks
that have elapsed since the system started up. All timers that have an
expires value smaller than or equal
to the value of jiffies are
considered to be expired or decayed.
该entry字段用于将软件定时器插入到双向链接循环列表之一中,该循环列表根据定时器的字段值将定时器分组在一起
expires。本章稍后将介绍使用这些列表的算法。
The entry field is used to
insert the software timer into one of the doubly linked circular lists
that group together the timers according to the value of their
expires field. The algorithm that
uses these lists is described later in this chapter.
要创建并激活动态计时器,内核必须:
To create and activate a dynamic timer, the kernel must:
如有必要,创建一个新timer_list对象 - 例如,
t. 这可以通过多种方式完成:
在代码中定义静态全局变量。
在函数内部定义局部变量;在这种情况下,该对象存储在内核模式堆栈上。
将对象包含在动态分配的描述符中。
Create, if necessary, a new timer_list object — for example,
t. This can be done in several
ways by:
Defining a static global variable in the code.
Defining a local variable inside a function; in this case, the object is stored on the Kernel Mode stack.
Including the object in a dynamically allocated descriptor.
通过调用函数来初始化对象init_timer(&t)。这实质上将t.base
指针字段设置为“打开”NULL并将t.lock自旋锁设置为“打开”。
Initialize the object by invoking the init_timer(&t) function. This
essentially sets the t.base
pointer field to NULL and sets
the t.lock spin lock to
"open."
function将定时器衰减时要激活的功能的地址加载到该字段中。如果需要,请使用data要传递给函数的参数值加载字段。
Load the function field
with the address of the function to be activated when the timer
decays. If required, load the data field with a parameter value to be
passed to the function.
如果动态计时器尚未插入到列表中,请为该expires字段分配适当的值并调用该add_timer(&t)函数以将元素插入
t到适当的列表中。
If the dynamic timer is not already inserted in a list,
assign a proper value to the expires field and invoke the add_timer(&t) function to insert the
t element in the proper
list.
否则,如果动态计时器已插入列表中,expires请通过调用该mod_timer( )
函数来更新字段,该函数还负责将对象移动到正确的列表中(接下来讨论)。
Otherwise, if the dynamic timer is already inserted in a
list, update the expires field
by invoking the mod_timer( )
function, which also takes care of moving the object into the
proper list (discussed next).
一旦计时器到期,内核就会自动
t从列表中删除该元素。del_timer( )然而,有时进程应该使用、del_timer_sync( )或函数显式地从其列表中删除计时器del_singleshot_timer_sync( )。事实上,睡眠进程可能会在超时结束之前被唤醒;在这种情况下,进程可能会选择销毁计时器。调用del_timer( )已从列表中删除的计时器不会造成任何损害,因此在计时器函数中删除计时器被认为是一个很好的做法。
Once the timer has decayed, the kernel automatically removes the
t element from its list. Sometimes,
however, a process should explicitly remove a timer from its list
using the del_timer( ), del_timer_sync( ), or del_singleshot_timer_sync( ) functions.
Indeed, a sleeping process may be woken up before the time-out is
over; in this case, the process may choose to destroy the timer.
Invoking del_timer( ) on a timer
already removed from a list does no harm, so removing the timer within
the timer function is considered a good practice.
在 Linux 2.6 中,动态计时器与激活它的 CPU 绑定,也就是说,计时器函数将始终在首先执行该函数或add_timer( )稍后执行该mod_timer( )函数的同一 CPU 上运行。然而,和del_timer( )配套函数可以停用每个动态定时器,即使它没有绑定到本地 CPU。
In Linux 2.6, a dynamic timer is bound to the CPU that activated
it—that is, the timer function will always run on the same CPU that
first executed the add_timer( ) or
later the mod_timer( ) function.
The del_timer( ) and companion
functions, however, can deactivate every dynamic timer, even if it is
not bound to the local CPU.
由于异步激活,动态计时器很容易出现竞争状况。例如,考虑一个动态定时器,其功能作用于可丢弃资源(例如,内核模块或文件数据结构)。如果在资源不再存在时激活计时器功能,则在不停止计时器的情况下释放资源可能会导致数据损坏。因此,经验法则是在释放资源之前停止计时器:
Being asynchronously activated, dynamic timers are prone to race conditions. For instance, consider a dynamic timer whose function acts on a discardable resource (e.g., a kernel module or a file data structure). Releasing the resource without stopping the timer may lead to data corruption if the timer function got activated when the resource no longer exists. Thus, a rule of thumb is to stop the timer before releasing the resource:
...
del_timer(&t);
X_Release_Resources();
... ...
del_timer(&t);
X_Release_Resources( );
...然而,在多处理器系统中,此代码并不安全,因为del_timer( )调用计时器函数时可能已经在另一个 CPU 上运行。因此,当计时器函数仍在对资源起作用时,资源可能会被释放。为了避免这种竞争情况,内核提供了该del_timer_sync(
)功能。它从列表中删除计时器,然后检查计时器函数是否在另一个CPU上执行;在这种情况下,del_timer_sync( )
等待直到计时器函数终止。
In multiprocessor systems, however, this code is not safe
because the timer function might already be running on another CPU
when del_timer( ) is invoked. As
a result, resources may be released while the timer function is
still acting on them. To avoid this kind of race condition, the
kernel offers the del_timer_sync(
) function. It removes the timer from the list, and then
it checks whether the timer function is executed on another CPU; in
such a case, del_timer_sync( )
waits until the timer function terminates.
该del_timer_sync( )
函数相当复杂且缓慢,因为它必须仔细考虑计时器函数重新激活自身的情况。如果内核开发人员知道计时器函数永远不会重新激活计时器,那么她可以使用更简单、更快的del_singleshot_timer_sync(
)函数来停用计时器并等待计时器函数终止。
The del_timer_sync( )
function is rather complex and slow, because it has to carefully
take into consideration the case in which the timer function
reactivates itself. If the kernel developer knows that the timer
function never reactivates the timer, she can use the simpler and
faster del_singleshot_timer_sync(
) function to deactivate a timer and wait until the timer
function terminates.
当然,还存在其他类型的竞争条件。例如,修改expires已激活计时器字段的正确方法包括使用mod_timer(
),而不是删除计时器并随后重新创建它。在后一种方法中,想要修改expires同一定时器字段的两个内核控制路径可能会严重混淆。定时器函数的实现通过lock每个对象中包含的自旋锁实现了 SMP 安全timer_list:每次内核必须访问动态定时器时,它都会禁用中断并获取该自旋锁。
Other types of race conditions exist, of course. For instance,
the right way to modify the expires field of an already activated
timer consists of using mod_timer(
), rather than deleting the timer and re-creating it
thereafter. In the latter approach, two kernel control paths that
want to modify the expires field
of the same timer may mix each other up badly. The implementation of
the timer functions is made SMP-safe by means of the lock spin lock included in every timer_list object: every time the kernel
must access a dynamic timer, it disables the interrupts and acquires
this spin lock.
选择合适的数据结构来实现动态定时器并不容易。将所有计时器串在一个列表中会降低系统性能,因为每次滴答都扫描一长串计时器的成本很高。另一方面,维护一个排序列表并不会更有效率,因为插入和删除操作的成本也很高。
Choosing the proper data structure to implement dynamic timers is not easy. Stringing together all timers in a single list would degrade system performance, because scanning a long list of timers at every tick is costly. On the other hand, maintaining a sorted list would not be much more efficient, because the insertion and deletion operations would also be costly.
所采用的解决方案基于巧妙的数据结构,该结构将expires值划分为刻度块,并允许动态计时器有效地从具有较大expires值的列表渗透到具有较小值的列表。此外,在多处理器系统中,活动动态定时器集被分配给各个CPU。
The adopted solution is based on a clever data structure that
partitions the expires values
into blocks of ticks and allows dynamic timers to percolate
efficiently from lists with larger expires values to lists with smaller ones.
Moreover, in multiprocessor systems the set of active dynamic timers
is split among the various CPUs.
动态定时器的主要数据结构是一个每 CPU 变量(参见第 5 章中的“每 CPU 变量”
一节),名称为:它包含元素,系统中的每个 CPU 都有一个元素。每个元素都是一个结构体,其中包括处理绑定到相应 CPU 的动态定时器所需的所有数据:tvec_basesNR_CPUStvec_base_t
The main data structure for dynamic timers is a per-CPU
variable (see the section "Per-CPU Variables" in
Chapter 5) named tvec_bases: it includes NR_CPUS elements, one for each CPU in the
system. Each element is a tvec_base_t structure, which includes all
data needed to handle the dynamic timers bound to the corresponding
CPU:
typedef 结构 tvec_t_base_s {
spinlock_t 锁;
无符号长timer_jiffies;
结构timer_list *running_timer;
tvec_root_t 电视1;
tvec_t 电视2;
tvec_t 电视3;
tvec_t 电视4;
tvec_t 电视5;
tvec_base_t; typedef struct tvec_t_base_s {
spinlock_t lock;
unsigned long timer_jiffies;
struct timer_list *running_timer;
tvec_root_t tv1;
tvec_t tv2;
tvec_t tv3;
tvec_t tv4;
tvec_t tv5;
} tvec_base_t;该tv1字段是 类型的结构tvec_root_t,其中包含vec256 个元素的数组
list_head,即动态计时器列表。它包含所有动态计时器(如果有),这些计时器将在接下来的 255 个时钟周期内衰减。
The tv1 field is a
structure of type tvec_root_t,
which includes a vec array of 256
list_head elements — that is,
lists of dynamic timers. It contains all dynamic timers, if any,
that will decay within the next 255 ticks.
、和字段是由64 个元素的数组tv2组成的类型结构。这些列表包含将分别在接下来的 2 14 -1、2 20 -1 和 2 26 -1 滴答内衰减的所有动态计时器。tv3tv4tvec_tveclist_head
The tv2, tv3, and tv4 fields are structures of type tvec_t consisting of a vec array of 64 list_head elements. These lists contain
all dynamic timers that will decay within the next
214-1, 220-1,
and 226-1 ticks, respectively.
该tv5字段与前面的字段相同,只是数组的最后一项vec是一个包含具有极大expires字段的动态计时器的列表。它永远不需要从另一个阵列补充。图6-1示意性地说明了五组列表。
The tv5 field is identical
to the previous ones, except that the last entry of the vec array is a list that includes dynamic
timers with extremely large expires fields. It never needs to be
replenished from another array. Figure 6-1 illustrates in
a schematic way the five groups of lists.
该timer_jiffies字段表示尚未检查的动态定时器的最早到期时间:如果与 的值一致jiffies,则没有累积可延迟函数的积压;如果它小于jiffies,则必须处理引用先前滴答的动态计时器列表。该字段在系统启动时设置
,并且仅通过下一节中描述的功能jiffies增加。run_timer_softirq(
)请注意,该
timer_jiffies字段可能会落后很远jiffies当处理动态定时器的可延迟函数长时间未执行时,例如因为这些函数已被禁用或因为已执行大量中断处理程序。
The timer_jiffies field
represents the earliest expiration time of the dynamic timers yet to
be checked: if it coincides with the value of jiffies, no backlog of deferrable
functions has accumulated; if it is smaller than jiffies, then lists of dynamic timers that
refer to previous ticks must be dealt with. The field is set to
jiffies at system startup and is
increased only by the run_timer_softirq(
) function described in the next section. Notice that the
timer_jiffies field might drop a
long way behind jiffies when the
deferrable functions that handle dynamic timers are not executed for
a long time—for instance because these functions have been disabled
or because a large number of interrupt handlers have been
executed.
在多处理器系统中,该running_timer字段指向timer_list当前由本地CPU处理的动态定时器的结构。
In multiprocessor systems, the running_timer field points to the timer_list structure of the dynamic timer
that is currently handled by the local CPU.
尽管数据结构巧妙,但处理软件定时器是一项耗时的活动,不应由定时器中断处理程序执行。在 Linux 2.6 中,此活动由可延迟函数(即 softirq)执行TIMER_SOFTIRQ。
Despite the clever data structures, handling software
timers is a time-consuming activity that should not be performed by
the timer interrupt handler. In Linux 2.6 this activity is carried
on by a deferrable function, namely the TIMER_SOFTIRQ softirq.
该run_timer_softirq( )
函数是与软中断相关的可延迟函数TIMER_SOFTIRQ。它主要执行以下操作:
The run_timer_softirq( )
function is the deferrable function associated with the TIMER_SOFTIRQ softirq. It essentially
performs the following actions:
在局部变量中存储与本地 CPU 关联的数据结构base
的地址。tvec_base_t
Stores in the base
local variable the address of the tvec_base_t data structure associated
with the local CPU.
获取base->lock自旋锁并禁用本地中断。
Acquires the base->lock spin lock and disables
local interrupts.
启动一个循环,当大于 的值while时结束。在循环的每次执行中,执行以下子步骤:base->timer_jiffiesjiffies
base->tv1计算保存下一个要处理的计时器的列表的索引:
索引 = 基址->timer_jiffies & 255;
如果index为零,则所有列表都已base->tv1
被检查,因此它们为空:因此该函数通过调用来渗透动态计时器cascade( ):
if (!索引 &&
(!cascade(base, &base->tv2, (base->timer_jiffies>> 8)&63)) &&
(!cascade(base, &base->tv3, (base->timer_jiffies>>14)&63)) &&
(!cascade(base, &base->tv4, (base->timer_jiffies>>20)&63)))
级联(base, &base->tv5, (base->timer_jiffies>>26)&63);考虑该函数的第一次调用cascade( ):它接收 中的地址base、 的地址base->tv2以及列表中的索引作为参数base->tv2
,其中包括将在接下来的 256 个时钟周期内衰减的计时器。该索引是通过查看值的正确位来确定的base->timer_jiffies。
cascade( )将列表中的所有动态计时器移至base->tv2适当的列表中base->tv1;然后,它返回一个正值,除非所有base->tv2列表现在都是空的。如果是,cascade( )则再次调用以补充base->tv2列表中包含的计时器base->tv3, 等等。
增加一base->timer_jiffies。
对于列表中的每个动态定时器base->tv1.vec[index],执行相应的定时器函数。特别是,对于列表中的每个timer_list
元素,基本上执行以下步骤:t
t从的列表中删除base->tv1。
在多处理器系统中,设置base->running_timer为
&t.
设置t.base为
NULL.
释放base->lock自旋锁,并启用本地中断。
t.function执行作为参数传递的
计时器函数t.data。
获取base->lock自旋锁,并禁用本地中断。
继续处理列表中的下一个计时器(如果有)。
列表中的所有计时器均已处理。继续最外层循环的下一次迭代while。
Starts a while loop,
which ends when base->timer_jiffies becomes greater
than the value of jiffies. In
every single execution of the cycle, performs the following
substeps:
Computes the index of the list in base->tv1 that holds the next
timers to be handled:
index = base->timer_jiffies & 255;
If index is zero,
all lists in base->tv1
have been checked, so they are empty: the function therefore
percolates the dynamic timers by invoking cascade( ):
if (!index &&
(!cascade(base, &base->tv2, (base->timer_jiffies>> 8)&63)) &&
(!cascade(base, &base->tv3, (base->timer_jiffies>>14)&63)) &&
(!cascade(base, &base->tv4, (base->timer_jiffies>>20)&63)))
cascade(base, &base->tv5, (base->timer_jiffies>>26)&63);Consider the first invocation of the cascade( ) function: it receives
as arguments the address in base, the address of base->tv2, and the index of the
list in base->tv2
including the timers that will decay in the next 256 ticks.
This index is determined by looking at the proper bits of
the base->timer_jiffies value.
cascade( ) moves all
dynamic timers in the base->tv2 list into the proper
lists of base->tv1;
then, it returns a positive value, unless all base->tv2 lists are now empty.
If so, cascade( ) is
invoked once more to replenish base->tv2 with the timers
included in a list of base->tv3, and so on.
Increases by one base->timer_jiffies.
For each dynamic timer in the base->tv1.vec[index] list,
executes the corresponding timer function. In particular,
for each timer_list
element t in the list
essentially performs the following steps:
Removes t from
the base->tv1's
list.
In multiprocessor systems, sets base->running_timer to
&t.
Sets t.base to
NULL.
Releases the base->lock spin lock, and
enables local interrupts.
Executes the timer function t.function passing as argument
t.data.
Acquires the base->lock spin lock, and
disables local interrupts.
Continues with the next timer in the list, if any.
All timers in the list have been handled. Continues
with the next iteration of the outermost while cycle.
最外面的while
循环终止,这意味着所有衰减的计时器都已被处理。在多处理器系统中,设置base->running_timer为NULL.
The outermost while
cycle is terminated, which means that all decayed timers have
been handled. In multiprocessor systems, sets base->running_timer to NULL.
释放base->lock自旋锁并启用本地中断。
Releases the base->lock spin lock and enables
local interrupts.
jiffies由于和的值timer_jiffies通常一致,因此最外面的while循环通常只执行一次。一般来说,最外层循环会
jiffies - base->timer_jiffies +
1连续执行多次。run_timer_softirq( )此外,如果在执行时发生定时器中断,则还要考虑在该滴答发生时衰减的动态定时器,因为该jiffies变量是由全局定时器中断处理程序异步增加的(请参阅前面的“定时器中断处理程序”部分)。
Because the values of jiffies and timer_jiffies usually coincide, the
outermost while cycle is often
executed only once. In general, the outermost loop is executed
jiffies - base->timer_jiffies +
1 consecutive times. Moreover, if a timer interrupt occurs
while run_timer_softirq( ) is
being executed, dynamic timers that decay at this tick occurrence
are also considered, because the jiffies variable is asynchronously
increased by the global timer interrupt handler (see the earlier
section "The timer
interrupt handler").
请注意,在进入最外层循环之前run_timer_softirq(
)禁用中断并获取自旋锁;base->lock在调用每个动态定时器函数之前,会启用中断并释放自旋锁,直到其终止。这确保了动态定时器数据结构不会被交错的内核控制路径破坏。
Notice that run_timer_softirq(
) disables interrupts and acquires the base->lock spin lock just before
entering the outermost loop; interrupts are enabled and the spin
lock is released right before invoking each dynamic timer function,
until its termination. This ensures that the dynamic timer data
structures are not corrupted by interleaved kernel control
paths.
综上所述,这种相当复杂的算法保证了出色的性能。要了解原因,为了简单起见,假设
TIMER_SOFTIRQ软中断在相应的定时器中断发生后立即执行。然后,在 256 次定时器中断发生中的 255 次(99.6% 的情况下)中,该
run_timer_softirq( )函数仅运行已衰减定时器的函数(如果有)。为了定期补充
,将 的一个列表划分为 的 256
base->tv1.vec个列表 64 次中的 63 次就足够了。反过来,该阵列必须在 0.006% 的情况下进行补充(即每 16.4 秒一次)。同样,每 17 分 28 秒补充一次,并且base->tv2base->tv1base->tv2.vecbase->tv3.vecbase->tv4.vec每18小时38分钟补充一次。base->tv5.vec不需要补充。
To sum up, this rather complex algorithm ensures excellent
performance. To see why, assume for the sake of simplicity that the
TIMER_SOFTIRQ softirq is executed
right after the corresponding timer interrupt occurs. Then, in 255
timer interrupt occurrences out of 256 (in 99.6% of the cases), the
run_timer_softirq( ) function
just runs the functions of the decayed timers, if any. To replenish
base->tv1.vec periodically, it
is sufficient 63 times out of 64 to partition one list of base->tv2 into the 256 lists of
base->tv1. The base->tv2.vec array, in turn, must be
replenished in 0.006 percent of the cases (that is, once every 16.4
seconds). Similarly, base->tv3.vec is replenished every 17
minutes and 28 seconds, and base->tv4.vec is replenished every 18
hours and 38 minutes. base->tv5.vec doesn't need to be
replenished.
为了展示之前所有活动的结果如何在内核中实际使用,我们将展示一个示例进程超时的创建和使用。
To show how the outcomes of all the previous activities are actually used in the kernel, we'll show an example of the creation and use of a process time-out.
让我们考虑一下nanosleep()系统调用的服务例程,即sys_nanosleep(),它接收一个指向timespec
结构的指针作为其参数,并挂起调用进程,直到指定的时间间隔过去。服务例程首先调用copy_from_user()将用户模式结构中包含的值复制timespec
到局部变量中t。假设该timespec结构定义了一个非空延迟,则该函数将执行以下代码:
Let's consider the service routine of the nanosleep() system call, that is, sys_nanosleep(), which receives as its
parameter a pointer to a timespec
structure and suspends the invoking process until the specified time
interval elapses. The service routine first invokes copy_from_user() to copy the values
contained in the User Mode timespec
structure into the local variable t. Assuming that the timespec structure defines a non-null delay,
the function then executes the following code:
当前->状态 = TASK_INTERRUPTIBLE;
剩余=schedule_timeout(timespec_to_jiffies(&t)+1); current->state = TASK_INTERRUPTIBLE;
remaining = schedule_timeout(timespec_to_jiffies(&t)+1);该timespec_to_jiffies( )
函数将结构中存储的时间间隔转换为刻度timespec。为了安全起见,
sys_nanosleep( )请在 计算出的值上加一个刻度timespec_to_jiffies(
)。
The timespec_to_jiffies( )
function converts in ticks the time interval stored in the timespec structure. To be on the safe side,
sys_nanosleep( ) adds one tick to
the value computed by timespec_to_jiffies(
).
内核实现进程超时通过使用动态计时器。它们出现在schedule_timeout( )函数中,该函数本质上执行以下语句:
The kernel implements process time-outs by using dynamic timers. They appear in the schedule_timeout( ) function, which
essentially executes the following statements:
结构体timer_list定时器;
无符号长过期 = 超时 + jiffies;
init_timer(&定时器);
计时器.expires = 过期;
timer.data = (unsigned long) 当前;
定时器.function = process_timeout;
add_timer(&定时器);
日程( ); /* 进程挂起直到定时器超时 */
del_singleshot_timer_sync(&定时器);
超时=过期-jiffies;
return (超时 < 0 ? 0 : 超时); struct timer_list timer;
unsigned long expire = timeout + jiffies;
init_timer(&timer);
timer.expires = expire;
timer.data = (unsigned long) current;
timer.function = process_timeout;
add_timer(&timer);
schedule( ); /* process suspended until timer expires */
del_singleshot_timer_sync(&timer);
timeout = expire - jiffies;
return (timeout < 0 ? 0 : timeout);调用时schedule( ),选择另一个进程执行;当前一个进程恢复执行时,该函数将删除动态计时器。在最后一条语句中,如果超时已过期,则该函数返回 0;如果进程由于某种其他原因被唤醒,则返回超时过期之前的滴答数。
When schedule( ) is invoked,
another process is selected for execution; when the former process
resumes its execution, the function removes the dynamic timer. In the
last statement, the function either returns 0, if the time-out is
expired, or the number of ticks left to the time-out expiration if the
process was awakened for some other reason.
当超时结束时,执行定时器的函数:
When the time-out expires, the timer's function is executed:
void process_timeout(无符号长__data)
{
wake_up_process((task_t *)__data);
} void process_timeout(unsigned long __data)
{
wake_up_process((task_t *)__data);
}接收存储在对象字段process_timeout( )
中的进程描述符指针作为其参数
。结果,挂起的进程被唤醒。datatimer
The process_timeout( )
receives as its parameter the process descriptor pointer stored in the
data field of the timer object. As a result, the suspended
process is awakened.
一旦被唤醒,进程就会继续执行系统
sys_nanosleep( )调用。如果返回的值schedule_timeout(
)指定进程超时已过期(值为零),则系统调用终止。否则,系统调用将自动重新启动,如第 11 章“系统调用的重新执行”部分所述。
Once awakened, the process continues the execution of the
sys_nanosleep( ) system call. If
the value returned by schedule_timeout(
) specifies that the process time-out is expired (value
zero), the system call terminates. Otherwise, the system call is
automatically restarted, as explained in the section "Reexecution of System
Calls" in Chapter
11.
当内核必须等待很短的时间间隔(比方说,小于几毫秒)时,软件计时器就没用了。例如,设备驱动程序通常必须等待预定义的微秒数,直到硬件完成某些操作。由于动态定时器的设置开销较大,并且最小等待时间较长(1 毫秒),因此设备驱动程序无法方便地使用它。
Software timers are useless when the kernel must wait for a short time interval—let's say, less than a few milliseconds. For instance, often a device driver has to wait for a predefined number of microseconds until the hardware completes some operation. Because a dynamic timer has a significant setup overhead and a rather large minimum wait time (1 millisecond), the device driver cannot conveniently use it.
在这些情况下,内核使用udelay( )和ndelay( )函数:前者接收以微秒为单位的时间间隔作为其参数,并在指定的延迟过去后返回;后者类似,但参数指定以纳秒为单位的延迟。
In these cases, the kernel makes use of the udelay( ) and ndelay( ) functions: the former receives as
its parameter a time interval in microseconds and returns after the
specified delay has elapsed; the latter is similar, but the argument
specifies the delay in nanoseconds.
本质上,这两个函数定义如下:
Essentially, the two functions are defined as follows:
void udelay(无符号长usecs)
{
无符号长循环;
循环=(usecs*HZ*current_cpu_data.loops_per_jiffy)/1000000;
cur_timer->延迟(循环);
}
void ndelay(无符号长纳秒)
{
无符号长循环;
循环=(纳秒*HZ*current_cpu_data.loops_per_jiffy)/1000000000;
cur_timer->延迟(循环);
}void udelay(unsigned long usecs)
{
unsigned long loops;
loops = (usecs*HZ*current_cpu_data.loops_per_jiffy)/1000000;
cur_timer->delay(loops);
}
void ndelay(unsigned long nsecs)
{
unsigned long loops;
loops = (nsecs*HZ*current_cpu_data.loops_per_jiffy)/1000000000;
cur_timer->delay(loops);
}这两个函数都依赖于计时器对象delay的方法(请参阅前面的“计时体系结构的数据结构cur_timer”部分),该对象接收“循环”中的时间间隔作为其参数。然而,一个“循环”的确切持续时间取决于所引用的计时器对象(参见本章前面的表 6-2 ):cur_timer
Both functions rely on the delay method of the cur_timer timer object (see the earlier
section "Data Structures
of the Timekeeping Architecture"), which receives as its
parameter a time interval in "loops." The exact duration of one
"loop," however, depends on the timer object referred by cur_timer (see Table 6-2 earlier in this
chapter):
如果cur_timer指向timer_hpet、timer_pmtmr、 和timer_tsc对象,则一个“循环”对应一个 CPU 周期,即两个连续 CPU 时钟信号之间的时间间隔(请参阅前面的“时间戳计数器 (TSC) ”一节)。
If cur_timer points to
the timer_hpet, timer_pmtmr, and timer_tsc objects, one "loop"
corresponds to one CPU cycle—that is, the time interval between
two consecutive CPU clock signals (see the earlier section "Time Stamp Counter
(TSC)").
如果cur_timer指向timer_none或timer_pit对象,则一个“循环”对应于紧密指令循环的单次迭代的持续时间。
If cur_timer points to
the timer_none or timer_pit objects, one "loop"
corresponds to the time duration of a single iteration of a tight
instruction loop.
在初始化阶段,在cur_timer设置后select_timer( ),内核执行该
calibrate_delay( )函数,该函数确定一个tick中有多少个“循环”。然后将该值保存在current_cpu_data.loops_per_jiffy变量中,以便可以使用它并将udelay( )
微秒ndelay( )和纳秒分别转换为“循环”。
During the initialization phase, after cur_timer has been set up by select_timer( ), the kernel executes the
calibrate_delay( ) function, which
determines how many "loops" fit in a tick. This value is then saved in
the current_cpu_data.loops_per_jiffy variable,
so that it can be used by udelay( )
and ndelay( ) to convert
microseconds and nanoseconds, respectively, to "loops."
当然,该cur_timer->delay(
)方法利用 HPET 或 TSC 硬件电路(如果可用)来获得准确的时间测量。否则,如果没有可用的 HPET 或 TSC,则该方法将执行loops紧密指令循环的迭代。
Of course, the cur_timer->delay(
) method makes use of the HPET or TSC hardware circuitry, if
available, to get an accurate measurement of time. Otherwise, if no
HPET or TSC is available, the method executes loops iterations of a tight instruction
loop.
多个系统调用允许用户模式进程读取和修改时间和日期以及创建计时器。让我们简要回顾一下这些并讨论内核如何处理它们。
Several system calls allow User Mode processes to read and modify the time and date and to create timers. Let's briefly review these and discuss how the kernel handles them.
Processes in User Mode can get the current time and date by means of several system calls:
time( )time( )返回自 1970 年 1 月 1 日午夜 (UTC) 开始以来经过的秒数。
Returns the number of elapsed seconds since midnight at the start of January 1, 1970 (UTC).
gettimeofday( )gettimeofday( )在名为 的数据结构中返回timeval自 1970 年 1 月 1 日午夜 (UTC) 以来经过的秒数以及最后一秒经过的微秒数(timezone当前未使用名为 的第二个数据结构)。
Returns, in a data structure named timeval, the number of elapsed seconds
since midnight of January 1, 1970 (UTC) and the number of
elapsed microseconds in the last second (a second data structure
named timezone is not
currently used).
该time( )系统调用已被 取代gettimeofday( ),但为了向后兼容,它仍包含在 Linux 中。另一个广泛使用的函数ftime( )不再作为系统调用实现,它返回自 1970 年 1 月 1 日午夜 (UTC) 以来经过的秒数以及最后一秒经过的毫秒数。
The time( ) system call is
superseded by gettimeofday( ), but
it is still included in Linux for backward compatibility. Another
widely used function, ftime( ),
which is no longer implemented as a system call, returns the number of
elapsed seconds since midnight of January 1, 1970 (UTC) and the number
of elapsed milliseconds in the last second.
系统gettimeofday( )调用是通过sys_gettimeofday(
)函数来实现的。为了计算当前日期和时间,此函数调用do_gettimeofday(
),它执行以下操作:
The gettimeofday( ) system
call is implemented by the sys_gettimeofday(
) function. To compute the current date and time of the day,
this function invokes do_gettimeofday(
), which executes the following actions:
获取xtime_lock
用于读取的 seqlock。
Acquires the xtime_lock
seqlock for reading.
get_offset通过调用计时器对象的方法确定自上次计时器中断以来经过的微秒数cur_timer:
usec = cur_timer->getoffset( );
正如前面“计时架构的数据结构”一节中所解释的,有四种可能的情况:
如果cur_timer指向该timer_hpet对象,则该方法将 HPET 计数器的当前值与上次执行定时器中断处理程序时保存的同一计数器的值进行比较。
如果cur_timer指向该timer_pmtmr对象,该方法会将 ACPI PMT 计数器的当前值与上次执行定时器中断处理程序时保存的同一计数器的值进行比较。
如果cur_timer指向该timer_tsc对象,则该方法将时间戳计数器的当前值与上次执行定时器中断处理程序时保存的 TSC 值进行比较。
如果cur_timer指向该timer_pit对象,则该方法读取 PIT 计数器的当前值,以计算自上次 PIT 计时器中断以来经过的微秒数。
Determines the number of microseconds elapsed since the last
timer interrupt by invoking the get_offset method of the cur_timer timer object:
usec = cur_timer->getoffset( );
As explained in the earlier section "Data Structures of the Timekeeping Architecture," there are four possible cases:
If cur_timer points
to the timer_hpet object,
the method compares the current value of the HPET counter with
the value of the same counter saved in the last execution of
the timer interrupt handler.
If cur_timer points
to the timer_pmtmr object,
the method compares the current value of the ACPI PMT counter
with the value of the same counter saved in the last execution
of the timer interrupt handler.
If cur_timer points
to the timer_tsc object,
the method compares the current value of the Time Stamp
Counter with the value of the TSC saved in the last execution
of the timer interrupt handler.
If cur_timer points
to the timer_pit object,
the method reads the current value of the PIT counter to
compute the number of microseconds elapsed since the last
PIT's timer interrupt.
如果某些定时器中断丢失(请参阅本章前面的
“更新时间和日期usec”部分),该函数会添加相应的延迟:
usec += (jiffies - wall_jiffies) * 1000;
If some timer interrupt has been lost (see the section
"Updating the Time and
Date" earlier in this chapter), the function adds to
usec the corresponding
delay:
usec += (jiffies - wall_jiffies) * 1000;
添加到usec最后一秒经过的微秒数:
usec += (xtime.tv_nsec / 1000);
Adds to usec the
microseconds elapsed in the last second:
usec += (xtime.tv_nsec / 1000);
将 的内容复制xtime到系统调用参数 指定的用户空间缓冲区中tv,并将 的值添加到微秒字段usec:
电视->tv_sec = xtime->tv_sec;
tv->tv_usec = xtime->tv_usec + usec;Copies the contents of xtime into the user-space buffer
specified by the system call parameter tv, adding to the microseconds field the
value of usec:
tv->tv_sec = xtime->tv_sec;
tv->tv_usec = xtime->tv_usec + usec;调用read_seqretry( )
seqlock ,如果同时获取另一个内核控制路径进行写入,xtime_lock则跳回到步骤 1 。xtime_lock
Invokes read_seqretry( )
on the xtime_lock seqlock, and
jumps back to step 1 if another kernel control path has
concurrently acquired xtime_lock for writing.
检查微秒字段中是否溢出,如有必要,调整该字段和第二个字段:
while (tv->tv_usec >= 1000000) {
电视->tv_usec -= 1000000;
电视->tv_sec++;
}Checks for an overflow in the microseconds field, adjusting both that field and the second field if necessary:
while (tv->tv_usec >= 1000000) {
tv->tv_usec -= 1000000;
tv->tv_sec++;
}stime( )具有root权限的用户模式下的进程可以通过使用过时的或系统调用来修改当前日期和时间settimeofday( )。该sys_settimeofday( )函数调用
do_settimeofday( ),它执行与 的操作互补的操作do_gettimeofday( )。
Processes in User Mode with root privilege may modify the
current date and time by using either the obsolete stime( ) or the settimeofday( ) system call. The sys_settimeofday( ) function invokes
do_settimeofday( ), which executes
operations complementary to those of do_gettimeofday( ).
请注意,两个系统调用都会修改 的值,xtime同时保持 RTC 寄存器不变。因此,当系统关闭时,新的时间就会丢失,除非用户执行时钟程序来更改RTC值。
Notice that both system calls modify the value of xtime while leaving the RTC registers
unchanged. Therefore, the new time is lost when the system shuts down,
unless the user executes the clock program to change the RTC
value.
尽管时钟漂移确保所有系统最终都会偏离正确的时间,但突然改变时间既是一种管理上的麻烦,也是一种危险的行为。例如,想象一下,程序员试图构建一个大型程序,并根据文件时间戳来确保重新编译过时的目标文件。系统时间的巨大变化可能会造成混乱make程序并导致错误的构建。在计算机网络上实现分布式文件系统时,保持时钟调整也很重要。在这种情况下,明智的做法是调整互连 PC 的时钟,以便与所访问文件的索引节点关联的时间戳值保持一致。因此,系统通常配置为定期运行时间同步协议,例如网络时间协议 (NTP),以在每个时钟周期逐渐更改时间。该实用程序依赖于adjtimex( )Linux 中的系统调用。
Although clock drift ensures that all systems eventually
move away from the correct time, changing the time abruptly is both an
administrative nuisance and risky behavior. Imagine, for instance,
programmers trying to build a large program and depending on file
timestamps to make sure that out-of-date object files are recompiled.
A large change in the system's time could confuse the make program and lead to an incorrect build.
Keeping the clocks tuned is also important when implementing a
distributed filesystem on a network of computers. In this case, it is
wise to adjust the clocks of the interconnected PCs, so that the
timestamp values associated with the inodes of the accessed files are
coherent. Thus, systems are often configured to run a time
synchronization protocol such as Network Time Protocol (NTP) on a
regular basis to change the time gradually at each tick. This utility
depends on the adjtimex( ) system
call in Linux.
该系统调用存在于多个 Unix 变体中,但不应在可移植的程序中使用。它接收指向结构的指针作为其参数timex,根据字段中的值更新内核参数timex
,并返回与当前内核值相同的结构。此类内核值用于稍微调整每个时钟周期update_wall_time_one_tick( )添加的微秒数。xtime.tv_usec
This system call is present in several Unix variants, although
it should not be used in programs intended to be portable. It receives
as its parameter a pointer to a timex structure, updates kernel parameters
from the values in the timex
fields, and returns the same structure with current kernel values.
Such kernel values are used by update_wall_time_one_tick( ) to slightly
adjust the number of microseconds added to xtime.tv_usec at each tick.
Linux 允许用户模式进程激活称为间隔计时器的特殊计时器 。[ * ]定时器使 Unix 信号(参见第 11 章)定期发送到进程。还可以激活间隔计时器,使其在指定的延迟后仅发送一个信号。因此,每个间隔定时器的特点是:
Linux allows User Mode processes to activate special timers called interval timers .[*] The timers cause Unix signals (see Chapter 11) to be sent periodically to the process. It is also possible to activate an interval timer so that it sends just one signal after a specified delay. Each interval timer is therefore characterized by:
必须发射信号的频率,如果只需要生成一个信号,则为空值
The frequency at which the signals must be emitted, or a null value if just one signal has to be generated
生成下一个信号之前的剩余时间
The time remaining until the next signal is to be generated
之前关于准确性的警告适用于这些计时器。它们保证在请求的时间过后执行,但无法准确预测它们何时交付。
The earlier warning about accuracy applies to these timers. They are guaranteed to execute after the requested time has elapsed, but it is impossible to predict exactly when they will be delivered.
间隔计时器通过 POSIXsetitimer( )系统调用激活。第一个参数指定应采用以下哪种策略:
Interval timers are activated by means of the POSIX setitimer( ) system call. The first
parameter specifies which of the following policies should be
adopted:
ITIMER_REALITIMER_REAL实际经过的时间;进程接收SIGALRM信号。
The actual elapsed time; the process receives SIGALRM signals.
ITIMER_VIRTUALITIMER_VIRTUAL进程在用户态所花费的时间;进程接收SIGVTALRM
信号。
The time spent by the process in User Mode; the process
receives SIGVTALRM
signals.
ITIMER_PROFITIMER_PROF进程在用户模式和内核模式下花费的时间;进程接收SIGPROF信号。
The time spent by the process both in User and in Kernel
Mode; the process receives SIGPROF signals.
间隔定时器可以是单次的,也可以是周期性的。第二个参数为setitimer( )
指向一个类型的结构,itimerval该结构指定计时器的初始持续时间(以秒和纳秒为单位)以及自动重新激活计时器时要使用的持续时间(对于单次计时器为零)。的第三个参数是一个可选setitimer( )指针到
itimerval由系统调用用先前的计时器参数填充的结构。
The interval timers can be either single-shot or periodic. The
second parameter of setitimer( )
points to a structure of type itimerval that specifies the initial
duration of the timer (in seconds and nanoseconds) and the duration to
be used when the timer is automatically reactivated (or zero for
single-shot timers).The third parameter of setitimer( ) is an optional pointer to an
itimerval structure that is filled
by the system call with the previous timer parameters.
为了为上述每个策略实现间隔计时器,进程描述符包括三对字段:
To implement an interval timer for each of the preceding policies, the process descriptor includes three pairs of fields:
it_real_incr和it_real_value
it_real_incr and it_real_value
it_virt_incr和it_virt_value
it_virt_incr and it_virt_value
it_prof_incr和it_prof_value
it_prof_incr and it_prof_value
每对的第一个字段存储两个信号之间的间隔(以刻度为单位);另一个字段存储计时器的当前值。
The first field of each pair stores the interval in ticks between two signals; the other field stores the current value of the timer.
间隔ITIMER_REAL定时器是使用动态定时器来实现的,因为即使进程没有在CPU上运行,内核也必须向进程发送信号。因此,每个进程描述符都包含一个名为 的动态计时器对象real_timer。系统setitimer( )调用初始化
real_timer字段,然后调用
add_timer( )将动态计时器插入到正确的列表中。当定时器到期时,内核执行it_real_fn( )定时器函数。反过来,该函数向进程it_real_fn( )发送信号;SIGALRM然后,如果it_real_incr不为空,则expires再次设置该字段,重新激活计时器。
The ITIMER_REAL interval
timer is implemented by using dynamic timers because the kernel must
send signals to the process even when it is not running on the CPU.
Therefore, each process descriptor includes a dynamic timer object
called real_timer. The setitimer( ) system call initializes the
real_timer fields and then invokes
add_timer( ) to insert the dynamic
timer in the proper list. When the timer expires, the kernel executes
the it_real_fn( ) timer function.
In turn, the it_real_fn( ) function
sends a SIGALRM signal to the
process; then, if it_real_incr is
not null, it sets the expires field
again, reactivating the timer.
和间隔计时器不需要动态计时器,因为它们可以在进程运行时更新ITIMER_VIRTUAL。
ITIMER_PROF和函数由 调用,由 PIT 的定时器account_it_virt( )
中断处理程序 (UP) 或本地定时器中断处理程序 (SMP) 调用。因此,两个间隔定时器每个时钟周期都会更新一次,如果它们到期,则会向当前进程发送适当的信号。account_it_prof( )update_ process_times(
)
The ITIMER_VIRTUAL and
ITIMER_PROF interval timers do not
require dynamic timers, because they can be updated while the process
is running. The account_it_virt( )
and account_it_prof( ) functions
are invoked by update_ process_times(
), which is called either by the PIT's timer interrupt
handler (UP) or by the local timer interrupt handlers (SMP).
Therefore, the two interval timers are updated once every tick, and if
they are expired, the proper signal is sent to the current
process.
当指定的时间间隔过去时,系统调用向调用进程发送alarm( )信号。它与使用参数调用时SIGALRM非常相似
,因为它使用进程描述符中包含的动态计时器。因此,和与参数不能同时使用。setitimer( )ITIMER_REALreal_timeralarm(
)setitimer( )ITIMER_REAL
The alarm( ) system call
sends a SIGALRM signal to the
calling process when a specified time interval has elapsed. It is very
similar to setitimer( ) when
invoked with the ITIMER_REAL
parameter, because it uses the real_timer dynamic timer included in the
process descriptor. Therefore, alarm(
) and setitimer( ) with
parameter ITIMER_REAL cannot be
used at the same time.
POSIX 1003.1b 标准为用户模式程序引入了一种新型软件定时器,特别是多线程和实时应用程序。这些定时器通常称为POSIX 定时器 。
The POSIX 1003.1b standard introduced a new type of software timers for User Mode programs—in particular, for multithreaded and real-time applications. These timers are often referred to as POSIX timers .
POSIX 定时器的每个实现都必须向用户模式程序提供一些POSIX 时钟 ,即具有预定义分辨率和属性的虚拟时间源。每当应用程序想要使用 POSIX 计时器时,它都会创建一个新的计时器资源,指定现有的 POSIX 时钟之一作为计时基准。表6-3列出了允许用户处理POSIX 时钟和定时器的系统调用。
Every implementation of the POSIX timers must offer to the User Mode programs a few POSIX clocks , that is, virtual time sources having predefined resolutions and properties. Whenever an application wants to make use of a POSIX timer, it creates a new timer resource specifying one of the existing POSIX clocks as the timing base. The system calls that allow users to handle POSIX clocks and timers are listed in Table 6-3.
表 6-3。POSIX 定时器和时钟的系统调用
Table 6-3. System calls for POSIX timers and clocks
系统调用 System call | 描述 Description |
|---|---|
| 获取 POSIX 时钟的当前值 Gets the current value of a POSIX clock |
设置 POSIX 时钟的当前值 Sets the current value of a POSIX clock | |
获取 POSIX 时钟的分辨率 Gets the resolution of a POSIX clock | |
基于指定的 POSIX 时钟创建新的 POSIX 计时器 Creates a new POSIX timer based on a specified POSIX clock | |
获取 POSIX 定时器的当前值和增量 Gets the current value and increment of a POSIX timer | |
设置 POSIX 定时器的当前值和增量 Sets the current value and increment of a POSIX timer | |
获取衰减的 POSIX 计时器的溢出次数 Gets the number of overruns of a decayed POSIX timer | |
| 销毁 POSIX 计时器 Destroys a POSIX timer |
时钟_nanosleep() clock_nanosleep() | 使用 POSIX 时钟作为时间源使进程进入睡眠状态 Puts the process to sleep using a POSIX clock as time source |
Linux 2.6 内核提供两种类型的 POSIX 时钟:
The Linux 2.6 kernel offers two types of POSIX clocks:
CLOCK_REALTIMECLOCK_REALTIME这个虚拟时钟代表系统的实时时钟——本质上是变量的值(参见前面的“更新时间和日期xtime”部分)。系统调用返回的分辨率为 999,848 纳秒,相当于
每秒大约 1000 次更新。clock_getres( )xtime
This virtual clock represents the real-time clock of the
system—essentially the value of the xtime variable (see the earlier
section "Updating the
Time and Date"). The resolution returned by the clock_getres( ) system call is 999,848
nanoseconds, which corresponds to roughly 1000 updates of
xtime in a second.
CLOCK_MONOTONICCLOCK_MONOTONIC该虚拟时钟代表系统的实时时钟,由于与外部时间源同步而清除了每次时间扭曲。xtime本质上,这个虚拟时钟由两个变量和的总和表示wall_to_monotonic(请参阅前面的部分“单处理器系统中的计时体系结构”)。由 返回的 POSIX 时钟的分辨率clock_getres( )为 999,848 纳秒。
This virtual clock represents the real-time clock of the
system purged of every time warp due to the synchronization with
an external time source. Essentially, this virtual clock is
represented by the sum of the two variables xtime and wall_to_monotonic (see the earlier
section "Timekeeping
Architecture in Uniprocessor Systems"). The resolution of
this POSIX clock, returned by clock_getres( ), is 999,848
nanoseconds.
Linux 内核通过动态定时器来实现 POSIX 定时器。因此,它们类似于ITIMER_REAL我们在上一节中描述的传统间隔计时器。然而,POSIX 定时器比传统的间隔定时器更加灵活和可靠。它们之间的一些显着差异是:
The Linux kernel implements the POSIX timers by means of dynamic
timers. Thus, they are similar to the traditional ITIMER_REAL interval timers we described in
the previous section. POSIX timers, however, are much more flexible
and reliable than traditional interval timers. A couple of significant
differences between them are:
当传统的间隔定时器衰减时,内核总是SIGALRM向激活定时器的进程发送信号。相反,当 POSIX 计时器衰减时,内核可以发送各种信号,要么发送到整个多线程应用程序,要么发送到单个指定的线程。内核还可以强制在应用程序的线程中执行通知程序函数,或者它甚至可以不执行任何操作(由用户模式库来处理事件)。
When a traditional interval timer decays, the kernel always
sends a SIGALRM signal to the
process that activated the timer. Instead, when a POSIX timer
decays, the kernel can send every kind of signal, either to the
whole multithreaded application or to a single specified thread.
The kernel can also force the execution of a notifier function in
a thread of the application, or it can even do nothing (it is up
to a User Mode library to handle the event).
如果传统的间隔计时器衰减多次,但用户模式进程无法接收信号SIGALRM(例如,因为信号被阻止或进程未运行),则仅接收到第一个信号:所有其他出现的信号都会SIGALRM丢失。这同样适用于 POSIX 定时器,但进程可以调用timer_getoverrun( )系统调用来获取自第一个信号生成以来定时器衰减的次数。
If a traditional interval timer decays many times but the
User Mode process cannot receive the SIGALRM signal (for instance because the
signal is blocked or the process is not running), only the first
signal is received: all other occurrences of SIGALRM are lost. The same holds for
POSIX timers, but the process can invoke the timer_getoverrun( ) system call to get
the number of times the timer decayed since the generation of the
first signal.
喜欢每次分享在 Linux 系统中,Linux 通过在很短的时间内从一个进程切换到另一个进程,达到了明显同时执行多个进程的神奇效果。进程切换本身已在第 3 章中讨论;本章涉及调度 ,它关系到何时切换以及选择哪个进程。
Like every time sharing system, Linux achieves the magical effect of an apparent simultaneous execution of multiple processes by switching from one process to another in a very short time frame. Process switching itself was discussed in Chapter 3; this chapter deals with scheduling , which is concerned with when to switch and which process to choose.
本章由三部分组成。“调度策略”一节抽象地介绍了Linux 为调度进程所做的选择。“调度算法”部分讨论了用于实现调度的数据结构和相应的算法。最后,“与调度相关的系统调用”一节描述了影响进程调度的系统调用。
The chapter consists of three parts. The section "Scheduling Policy" introduces the choices made by Linux in the abstract to schedule processes. The section "The Scheduling Algorithm" discusses the data structures used to implement scheduling and the corresponding algorithm. Finally, the section "System Calls Related to Scheduling" describes the system calls that affect process scheduling.
为了简化描述,我们照常指的是80×86架构;特别是,我们假设系统使用统一内存访问模型,并且系统时钟周期设置为 1 ms。
To simplify the description, we refer as usual to the 80 × 86 architecture; in particular, we assume that the system uses the Uniform Memory Access model, and that the system tick is set to 1 ms.
传统Unix操作系统的调度算法必须满足几个相互冲突的目标:快速的进程响应时间、后台作业的良好吞吐量、避免进程饥饿、协调低优先级和高优先级进程的需求等等。用于确定何时以及如何选择新进程运行的一组规则称为调度策略 。
The scheduling algorithm of traditional Unix operating systems must fulfill several conflicting objectives: fast process response time, good throughput for background jobs, avoidance of process starvation, reconciliation of the needs of low- and high-priority processes, and so on. The set of rules used to determine when and how to select a new process to run is called scheduling policy .
Linux调度是基于时间共享 技术的:多个进程以“时间复用”方式运行,因为CPU时间被分成片,每个可运行进程一个。[ * ]当然,单个处理器在任何给定时刻只能运行一个进程。如果当前正在运行的进程在其时间片或量程到期时没有终止,则可能会发生进程切换。时间共享依赖于定时器中断,因此对进程是透明的。程序中不需要插入额外的代码来确保CPU时间共享。
Linux scheduling is based on the time sharing technique: several processes run in "time multiplexing" because the CPU time is divided into slices, one for each runnable process.[*] Of course, a single processor can run only one process at any given instant. If a currently running process is not terminated when its time slice or quantum expires, a process switch may take place. Time sharing relies on timer interrupts and is thus transparent to processes. No additional code needs to be inserted in the programs to ensure CPU time sharing.
调度策略还基于根据优先级对进程进行排序。有时会使用复杂的算法来得出进程的当前优先级,但最终结果是相同的:每个进程都与一个值关联,该值告诉调度程序让该进程在 CPU 上运行是否合适。
The scheduling policy is also based on ranking processes according to their priority. Complicated algorithms are sometimes used to derive the current priority of a process, but the end result is the same: each process is associated with a value that tells the scheduler how appropriate it is to let the process run on a CPU.
在Linux中,进程优先级是动态的。调度程序跟踪进程正在做什么并定期调整它们的优先级;通过这种方式,长时间被拒绝使用 CPU 的进程可以通过动态提高其优先级来提升。相应地,长时间运行的进程会受到降低其优先级的惩罚。
In Linux, process priority is dynamic. The scheduler keeps track of what processes are doing and adjusts their priorities periodically; in this way, processes that have been denied the use of a CPU for a long time interval are boosted by dynamically increasing their priority. Correspondingly, processes running for a long time are penalized by decreasing their priority.
在谈到调度时,传统上将进程分为I/O 密集型或 CPU 密集型。前者大量使用 I/O 设备并花费大量时间等待 I/O 操作完成;后者执行需要大量 CPU 时间的数字运算应用程序。
When speaking about scheduling, processes are traditionally classified as I/O-bound or CPU-bound. The former make heavy use of I/O devices and spend much time waiting for I/O operations to complete; the latter carry on number-crunching applications that require a lot of CPU time.
另一种分类区分了三类过程:
An alternative classification distinguishes three classes of processes:
它们不断地与用户交互,因此花费大量时间等待按键和鼠标操作。当接收到输入时,必须快速唤醒进程,否则用户会发现系统没有响应。通常,平均延迟必须在 50 到 150 毫秒之间。这种延迟的方差也必须受到限制,否则用户会发现系统不稳定。典型的交互式程序是命令 shell、文本编辑器和图形应用程序。
These interact constantly with their users, and therefore spend a lot of time waiting for keypresses and mouse operations. When input is received, the process must be woken up quickly, or the user will find the system to be unresponsive. Typically, the average delay must fall between 50 and 150 milliseconds. The variance of such delay must also be bounded, or the user will find the system to be erratic. Typical interactive programs are command shells, text editors, and graphical applications.
这些不需要用户交互,因此它们通常在后台运行。由于此类进程不需要非常敏感,因此它们通常会受到调度程序的惩罚。典型的批处理程序是编程语言编译器、数据库搜索引擎和科学计算。
These do not need user interaction, and hence they often run in the background. Because such processes do not need to be very responsive, they are often penalized by the scheduler. Typical batch programs are programming language compilers, database search engines, and scientific computations.
这些都有非常严格的调度要求。此类进程永远不应被优先级较低的进程阻塞,并且应具有较短的保证响应时间和最小的方差。典型的实时程序是视频和声音应用程序、机器人控制器以及从物理传感器收集数据的程序。
These have very stringent scheduling requirements. Such processes should never be blocked by lower-priority processes and should have a short guaranteed response time with a minimum variance. Typical real-time programs are video and sound applications, robot controllers, and programs that collect data from physical sensors.
我们刚才提供的两个分类在某种程度上是独立的。例如,批处理可以是 I/O 密集型(例如数据库服务器)或 CPU 密集型(例如图像渲染程序)。虽然 Linux 中的调度算法明确识别实时程序,但没有简单的方法来区分交互式程序和批处理程序。Linux 2.6 调度程序根据进程过去的行为实现复杂的启发式算法,以决定给定进程是否应被视为交互式进程或批处理进程。当然,调度程序倾向于支持交互式进程而不是批处理进程。
The two classifications we just offered are somewhat independent. For instance, a batch process can be either I/O-bound (e.g., a database server) or CPU-bound (e.g., an image-rendering program). While real-time programs are explicitly recognized as such by the scheduling algorithm in Linux, there is no easy way to distinguish between interactive and batch programs. The Linux 2.6 scheduler implements a sophisticated heuristic algorithm based on the past behavior of the processes to decide whether a given process should be considered as interactive or batch. Of course, the scheduler tends to favor interactive processes over batch ones.
程序员可以通过表7-1所示的系统调用来改变调度优先级。“与调度相关的系统调用”部分给出了更多详细信息。
Programmers may change the scheduling priorities by means of the system calls illustrated in Table 7-1. More details are given in the section "System Calls Related to Scheduling."
表 7-1。与调度相关的系统调用
Table 7-1. System calls related to scheduling
系统调用 System call | 描述 Description |
|---|---|
| 更改常规进程的静态优先级 Change the static priority of a conventional process |
| 获取一组常规进程的最大静态优先级 Get the maximum static priority of a group of conventional processes |
| 设置一组常规进程的静态优先级 Set the static priority of a group of conventional processes |
| 获取进程的调度策略 Get the scheduling policy of a process |
| 设置进程的调度策略和实时优先级 Set the scheduling policy and the real-time priority of a process |
| 获取进程的实时优先级 Get the real-time priority of a process |
| 设置进程的实时优先级 Set the real-time priority of a process |
| 自愿放弃处理器而不阻塞 Relinquish the processor voluntarily without blocking |
| 获取策略的最小实时优先级值 Get the minimum real-time priority value for a policy |
| 获取策略的最大实时优先级值 Get the maximum real-time priority value for a policy |
| 获取循环策略的时间量值 Get the time quantum value for the Round Robin policy |
设置进程的CPU亲和性掩码 Set the CPU affinity mask of a process | |
获取进程的CPU亲和性掩码 Get the CPU affinity mask of a process |
正如第一章提到的,Linux进程是
可抢占的。当进程进入该TASK_RUNNING状态时,内核检查其动态优先级是否大于当前运行进程的优先级。如果是,则current中断执行并调用调度程序来选择另一个进程来运行(通常是刚刚变得可运行的进程)。当然,进程也可以在其时间片到期时被抢占。当这种情况发生时,当前进程结构TIF_NEED_RESCHED中的标志thread_info被设置,因此当定时器中断处理程序终止时调度程序被调用。
As mentioned in the first chapter, Linux processes are
preemptable. When a process enters the TASK_RUNNING state, the kernel checks
whether its dynamic priority is greater than the priority of the
currently running process. If it is, the execution of current is interrupted and the scheduler is
invoked to select another process to run (usually the process that
just became runnable). Of course, a process also may be preempted when
its time quantum expires. When this occurs, the TIF_NEED_RESCHED flag in the thread_info structure of the current process
is set, so the scheduler is invoked when the timer interrupt handler
terminates.
例如,让我们考虑一个场景,其中只有两个程序(文本编辑器和编译器)正在执行。文本编辑器是一个交互式程序,因此它比编译器具有更高的动态优先级。然而,它经常被暂停,因为用户在思考时间和数据输入之间交替暂停;而且两次按键之间的平均延迟也比较长。但是,一旦用户按下某个键,就会引发中断,并且内核会唤醒文本编辑器进程。内核还确定编辑器的动态优先级高于current当前运行的进程(编译器)的优先级,因此它设置TIF_NEED_RESCHED该进程的标志,从而在内核完成处理中断时强制激活调度程序。调度器选择编辑器并进行进程切换;因此,编辑器的执行很快就会恢复,并且用户键入的字符会回显到屏幕上。当字符被处理后,文本编辑器进程将自身挂起,等待另一个按键,编译器进程可以恢复执行。
For instance, let's consider a scenario in which only two
programs—a text editor and a compiler—are being executed. The text
editor is an interactive program, so it has a higher dynamic priority
than the compiler. Nevertheless, it is often suspended, because the
user alternates between pauses for think time and data entry;
moreover, the average delay between two keypresses is relatively long.
However, as soon as the user presses a key, an interrupt is raised and
the kernel wakes up the text editor process. The kernel also
determines that the dynamic priority of the editor is higher than the
priority of current, the currently
running process (the compiler), so it sets the TIF_NEED_RESCHED flag of this process, thus
forcing the scheduler to be activated when the kernel finishes
handling the interrupt. The scheduler selects the editor and performs
a process switch; as a result, the execution of the editor is resumed
very quickly and the character typed by the user is echoed to the
screen. When the character has been processed, the text editor process
suspends itself waiting for another keypress and the compiler process
can resume its execution.
请注意,被抢占的进程不会被挂起,因为它仍处于状态TASK_RUNNING;它只是不再使用CPU。此外,请记住Linux 2.6内核是抢占式的,这意味着进程在内核或用户模式下执行时都可以被抢占;我们在第 5 章“内核抢占”
部分深入讨论了这一特性。
Be aware that a preempted process is not suspended, because it
remains in the TASK_RUNNING state;
it simply no longer uses the CPU. Moreover, remember that the Linux
2.6 kernel is preemptive, which means that a process can be preempted
either when executing in Kernel or in User Mode; we discussed in depth
this feature in the section "Kernel Preemption" in
Chapter 5.
The quantum duration is critical for system performance: it should be neither too long nor too short.
如果平均量子持续时间太短,进程切换带来的系统开销就会变得过高。例如,假设一个进程切换需要5毫秒;如果时间片也设置为 5 毫秒,则至少 50% 的 CPU 周期将专用于进程切换。[ * ]
If the average quantum duration is too short, the system overhead caused by process switches becomes excessively high. For instance, suppose that a process switch requires 5 milliseconds; if the quantum is also set to 5 milliseconds, then at least 50 percent of the CPU cycles will be dedicated to process switching.[*]
如果平均时间片持续时间太长,进程看起来不再是并发执行的。例如,假设时间片设置为五秒;每个可运行进程都会前进大约五秒,但随后会停止很长一段时间(通常为五秒乘以可运行进程数)。
If the average quantum duration is too long, processes no longer appear to be executed concurrently. For instance, let's suppose that the quantum is set to five seconds; each runnable process makes progress for about five seconds, but then it stops for a very long time (typically, five seconds times the number of runnable processes).
人们通常认为,较长的量子持续时间会降低交互式应用程序的响应时间。这通常是错误的。正如本章前面“进程抢占”一节所述,交互式进程具有相对较高的优先级,因此无论量子持续时间有多长,它们都会快速抢占批处理进程。
It is often believed that a long quantum duration degrades the response time of interactive applications. This is usually false. As described in the section "Process Preemption" earlier in this chapter, interactive processes have a relatively high priority, so they quickly preempt the batch processes, no matter how long the quantum duration is.
然而,在某些情况下,非常长的量子持续时间会降低系统的响应能力。例如,假设两个用户同时在各自的 shell 提示符下输入两个命令;一个命令启动 CPU 密集型进程,而另一个命令启动交互式应用程序。两个 shell 都派生一个新进程并将用户命令的执行委托给它;此外,假设这些新进程具有相同的初始优先级(Linux 事先不知道要执行的程序是批处理程序还是交互式程序)。现在,如果调度程序选择首先运行受 CPU 限制的进程,则另一个进程可以在开始执行之前等待整个时间量。因此,如果量子持续时间很长,
In some cases, however, a very long quantum duration degrades the responsiveness of the system. For instance, suppose two users concurrently enter two commands at the respective shell prompts; one command starts a CPU-bound process, while the other launches an interactive application. Both shells fork a new process and delegate the execution of the user's command to it; moreover, suppose such new processes have the same initial priority (Linux does not know in advance if a program to be executed is batch or interactive). Now if the scheduler selects the CPU-bound process to run first, the other process could wait for a whole time quantum before starting its execution. Therefore, if the quantum duration is long, the system could appear to be unresponsive to the user that launched the interactive application.
平均量子持续时间的选择始终是一种折衷。Linux 采用的经验法则是选择尽可能长的持续时间,同时保持良好的系统响应时间。
The choice of the average quantum duration is always a compromise. The rule of thumb adopted by Linux is choose a duration as long as possible, while keeping good system response time.
[ * ]回想一下,调度算法无法选择停止和挂起的进程在 CPU 上运行。
[*] Recall that stopped and suspended processes cannot be selected by the scheduling algorithm to run on a CPU.
[ * ]事实上,事情可能比这更糟糕;例如,如果进程切换所需的时间计入进程量程,则所有 CPU 时间都专用于进程切换,并且没有进程可以继续终止。
[*] Actually, things could be much worse than this; for example, if the time required for the process switch is counted in the process quantum, all CPU time is devoted to the process switch and no process can progress toward its termination.
调度算法在早期版本的 Linux 中使用的方法非常简单明了:在每次进程切换时,内核都会扫描可运行进程的列表,计算它们的优先级,并选择要运行的“最佳”进程。该算法的主要缺点是选择最佳进程所花费的时间取决于可运行进程的数量;因此,在运行数千个进程的高端系统中,该算法成本太高,也就是说,它花费太多时间。
The scheduling algorithm used in earlier versions of Linux was quite simple and straightforward: at every process switch the kernel scanned the list of runnable processes, computed their priorities, and selected the "best" process to run. The main drawback of that algorithm is that the time spent in choosing the best process depends on the number of runnable processes; therefore, the algorithm is too costly—that is, it spends too much time—in high-end systems running thousands of processes.
Linux 2.6的调度算法更加复杂。根据设计,它可以很好地随可运行进程的数量进行扩展,因为它选择在恒定时间内运行的进程,而与可运行进程的数量无关。它还可以很好地随处理器数量进行扩展,因为每个 CPU 都有自己的可运行进程队列。此外,新算法可以更好地区分交互过程和批处理过程。因此,重负载系统的用户感觉 Linux 2.6 中的交互式应用程序比早期版本的响应速度要快得多。
The scheduling algorithm of Linux 2.6 is much more sophisticated. By design, it scales well with the number of runnable processes, because it selects the process to run in constant time, independently of the number of runnable processes. It also scales well with the number of processors because each CPU has its own queue of runnable processes. Furthermore, the new algorithm does a better job of distinguishing interactive processes and batch processes. As a consequence, users of heavily loaded systems feel that interactive applications are much more responsive in Linux 2.6 than in earlier versions.
调度程序总是成功地找到要执行的进程;事实上,总是至少有一个可运行的进程: 交换器进程,它的 PID 为 0,并且仅当 CPU 无法执行其他进程时才执行。正如第 3 章中提到的,多处理器系统的每个 CPU 都有自己的交换器进程,PID 等于 0。
The scheduler always succeeds in finding a process to be executed; in fact, there is always at least one runnable process: the swapper process, which has PID 0 and executes only when the CPU cannot execute other processes. As mentioned in Chapter 3, every CPU of a multiprocessor system has its own swapper process with PID equal to 0.
Every Linux process is always scheduled according to one of the following scheduling classes :
SCHED_FIFOSCHED_FIFO先进先出实时流程。当调度程序将 CPU 分配给进程时,它将进程描述符保留在运行队列列表中的当前位置。如果没有其他更高优先级的实时进程可运行,则该进程将继续使用 CPU,只要它愿意,即使其他实时进程具有相同优先级的可以运行。
A First-In, First-Out real-time process. When the scheduler assigns the CPU to the process, it leaves the process descriptor in its current position in the runqueue list. If no other higher-priority real-time process is runnable, the process continues to use the CPU as long as it wishes, even if other real-time processes that have the same priority are runnable.
SCHED_RRSCHED_RR循环实时进程。当调度程序将CPU分配给进程时,它会将进程描述符放在运行队列列表的末尾。此策略可确保将 CPU 时间公平分配给SCHED_RR具有相同优先级的所有实时进程。
A Round Robin real-time process. When the scheduler assigns
the CPU to the process, it puts the process descriptor at the end
of the runqueue list. This policy ensures a fair assignment of CPU
time to all SCHED_RR real-time
processes that have the same priority.
SCHED_NORMALSCHED_NORMAL传统的分时流程。
A conventional, time-shared process.
根据进程是常规进程还是实时进程,调度算法的行为有很大不同。
The scheduling algorithm behaves quite differently depending on whether the process is conventional or real-time.
每个常规进程都有自己的静态优先级,调度程序使用该值来相对于其他常规进程对进程进行评级。在系统中。内核代表常规进程的静态优先级,数字范围为100(最高优先级)到139(最低优先级);请注意,静态优先级随着值的增加而降低。
Every conventional process has its own static priority, which is a value used by the scheduler to rate the process with respect to the other conventional processes in the system. The kernel represents the static priority of a conventional process with a number ranging from 100 (highest priority) to 139 (lowest priority); notice that static priority decreases as the values increase.
新进程总是继承其父进程的静态优先级。然而,用户可以通过将一些“好的值”传递给nice( )和系统调用来更改他所拥有的进程的静态优先级(请参阅本章后面的“与调度相关的系统调用setpriority( )”部分)。
A new process always inherits the static priority of its parent.
However, a user can change the static priority of the processes that
he owns by passing some "nice values" to the nice( ) and setpriority( ) system calls (see the section
"System Calls Related to
Scheduling" later in this chapter).
静态优先级本质上决定了基准时间量 进程的时间片持续时间,即当进程用完其先前的时间片时分配给该进程的时间片持续时间。静态优先级和基准时间量之间的关系如下:
The static priority essentially determines the base time quantum of a process, that is, the time quantum duration assigned to the process when it has exhausted its previous time quantum. Static priority and base time quantum are related by the following formula:
正如您所看到的,静态优先级越高(即其数值越低),基本时间量越长。因此,优先级较高的进程通常比优先级较低的进程获得更长的 CPU 时间片。表7-2示出了具有最高静态优先级、默认静态优先级和最低静态优先级的传统进程的静态优先级、基本时间量子值和相应的nice值。(该表还列出了交互增量和睡眠时间阈值的值,本章稍后将对此进行解释。)
As you see, the higher the static priority (i.e., the lower its numerical value), the longer the base time quantum. As a consequence, higher priority processes usually get longer slices of CPU time with respect to lower priority processes. Table 7-2 shows the static priority, the base time quantum values, and the corresponding nice values for a conventional process having highest static priority, default static priority, and lowest static priority. (The table also lists the values of the interactive delta and of the sleep time threshold, which are explained later in this chapter.)
表 7-2。传统流程的典型优先级值
Table 7-2. Typical priority values for a conventional process
描述 Description | 静态优先级 Static priority | 物有所值 Nice value | 基本时间量 Base time quantum | 互动三角洲 Interactivedelta | 睡眠时间阈值 Sleep time threshold |
|---|---|---|---|---|---|
最高静态优先级 Highest static priority | 100 100 | -20 -20 | 800毫秒 800 ms | -3 -3 | 299 毫秒 299 ms |
高静态优先级 High static priority | 110 110 | | 600毫秒 600 ms | | 499 毫秒 499 ms |
默认静态优先级 Default static priority | 120 120 | 0 0 | 100毫秒 100 ms | +2 +2 | 799 毫秒 799 ms |
低静态优先级 Low static priority | 130 130 | +10 +10 | 50毫秒 50 ms | +4 +4 | 999 毫秒 999 ms |
最低静态优先级 Lowest static priority | 139 139 | +19 +19 | 5毫秒 5 ms | +6 +6 | 1199 毫秒 1199 ms |
除了静态优先级之外,传统进程还具有动态优先级,其值范围为100(最高优先级)到139(最低优先级)。动态优先级是调度程序在选择要运行的新进程时实际查找的数字。它与静态优先级通过以下经验公式相关:
Besides a static priority, a conventional process also has a dynamic priority, which is a value ranging from 100 (highest priority) to 139 (lowest priority). The dynamic priority is the number actually looked up by the scheduler when selecting the new process to run. It is related to the static priority by the following empirical formula:
动态优先级= max (100, min ( 静态优先级− 奖金 + 5, 139)) (2)
dynamic priority = max (100, min ( static priority − bonus + 5, 139)) (2)
奖金的数值范围是0到10;小于 5 的值表示降低动态优先级的惩罚,而大于 5 的值表示提高动态优先级的溢价。反过来,奖金的价值取决于流程的过去历史;更准确地说,它与平均睡眠时间有关 的过程。
The bonus is a value ranging from 0 to 10; a value less than 5 represents a penalty that lowers the dynamic priority, while a value greater than 5 is a premium that raises the dynamic priority. The value of the bonus, in turn, depends on the past history of the process; more precisely, it is related to the average sleep time of the process.
粗略地说,平均睡眠时间是进程在睡眠时花费的平均纳秒数。但请注意,这不是对经过时间的平均操作。例如,在TASK_INTERRUPTIBLE状态下睡眠对平均睡眠时间的影响与在状态下睡眠不同TASK_UNINTERRUPTIBLE。此外,进程运行时平均睡眠时间会减少。最后,平均睡眠时间永远不会超过 1 秒。
Roughly, the average sleep time is the average number of
nanoseconds that the process spent while sleeping. Be warned,
however, that this is not an average operation on the elapsed time.
For instance, sleeping in TASK_INTERRUPTIBLE state contributes to
the average sleep time in a different way from sleeping in TASK_UNINTERRUPTIBLE state. Moreover, the
average sleep time decreases while a process is running. Finally,
the average sleep time can never become larger than 1 second.
平均睡眠时间与奖励值的对应关系如表7-3所示。(该表还列出了时间片对应的粒度,稍后将讨论。)
The correspondence between average sleep times and bonus values is shown in Table 7-3. (The table lists also the corresponding granularity of the time slice, which will be discussed later.)
表 7-3。平均睡眠时间、奖励值和时间片粒度
Table 7-3. Average sleep times, bonus values, and time slice granularity
平均睡眠时间 Average sleep time | 奖金 Bonus | 粒度 Granularity |
|---|---|---|
大于等于0小于100ms Greater than or equal to 0 but smaller than 100 ms | 0 0 | 5120 5120 |
大于等于100ms小于200ms Greater than or equal to 100 ms but smaller than 200 ms | 1 1 | 2560 2560 |
大于等于200ms小于300ms Greater than or equal to 200 ms but smaller than 300 ms | 2 2 | 1280 1280 |
大于等于300ms小于400ms Greater than or equal to 300 ms but smaller than 400 ms | 3 3 | 640 640 |
大于等于400ms小于500ms Greater than or equal to 400 ms but smaller than 500 ms | 4 4 | 320 320 |
大于等于500ms小于600ms Greater than or equal to 500 ms but smaller than 600 ms | 5 5 | 160 160 |
大于等于600ms小于700ms Greater than or equal to 600 ms but smaller than 700 ms | 6 6 | 80 80 |
大于等于700ms小于800ms Greater than or equal to 700 ms but smaller than 800 ms | 7 7 | 40 40 |
大于等于800ms小于900ms Greater than or equal to 800 ms but smaller than 900 ms | 8 8 | 20 20 |
大于等于900ms小于1000ms Greater than or equal to 900 ms but smaller than 1000 ms | 9 9 | 10 10 |
1秒 1 second | 10 10 | 10 10 |
调度程序还使用平均睡眠时间来确定给定进程是否应被视为交互式进程或批处理进程。更准确地说,如果一个进程满足以下公式,则该进程被视为“交互式”:
The average sleep time is also used by the scheduler to determine whether a given process should be considered interactive or batch. More precisely, a process is considered "interactive" if it satisfies the following formula:
动态优先级 ≤3× 静态优先级/4+28 (3)
dynamic priority ≤ 3 × static priority / 4 + 28 (3)
这相当于以下内容:
which is equivalent to the following:
奖金− 5 ≥ 静态优先级/ 4 − 28
bonus − 5 ≥ static priority / 4 − 28
表达式静态优先级/ 4 − 28 称为交互式增量 ; 表7-2列出了该术语的一些典型值 。应该注意的是,高优先级进程比低优先级进程更容易交互。例如,当具有最高静态优先级 (100) 的进程的奖励值超过 2 时,即当其平均睡眠时间超过 200 ms 时,就被认为是交互式的。相反,具有最低静态优先级 (139) 的进程永远不会被视为交互式进程,因为奖励值始终小于达到等于 6 的交互式增量所需的值 11。具有默认静态优先级 (120) 的进程变为交互式进程,如下所示一旦其平均睡眠时间超过 700 毫秒。
The expression static priority / 4 − 28 is called the interactive delta ; some typical values of this term are listed in Table 7-2. It should be noted that it is far easier for high priority than for low priority processes to become interactive. For instance, a process having highest static priority (100) is considered interactive when its bonus value exceeds 2, that is, when its average sleep time exceeds 200 ms. Conversely, a process having lowest static priority (139) is never considered as interactive, because the bonus value is always smaller than the value 11 required to reach an interactive delta equal to 6. A process having default static priority (120) becomes interactive as soon as its average sleep time exceeds 700 ms.
即使具有较高静态优先级的传统进程获得较大的CPU时间片,它们也不应该完全锁定具有较低静态优先级的进程。为了避免进程饥饿,当一个进程完成其时间量时,它可以被时间量尚未耗尽的较低优先级进程取代。为了实现此机制,调度程序保留两组不相交的可运行进程:
Even if conventional processes having higher static priorities get larger slices of the CPU time, they should not completely lock out the processes having lower static priority. To avoid process starvation, when a process finishes its time quantum, it can be replaced by a lower priority process whose time quantum has not yet been exhausted. To implement this mechanism, the scheduler keeps two disjoint sets of runnable processes:
然而,一般模式比这稍微复杂一些,因为调度程序试图提高交互式进程的性能。完成其时间量程的活动批处理总是会过期。完成其时间量的活动交互进程通常保持活动状态:调度程序重新填充其时间量并将其保留在活动进程集中。但是,如果最旧的过期进程已经等待很长时间,或者过期进程比交互进程具有更高的静态优先级(更低的值),则调度程序会将完成其时间量程的交互进程移动到过期进程集中。作为结果,
However, the general schema is slightly more complicated than this, because the scheduler tries to boost the performance of interactive processes. An active batch process that finishes its time quantum always becomes expired. An active interactive process that finishes its time quantum usually remains active: the scheduler refills its time quantum and leaves it in the set of active processes. However, the scheduler moves an interactive process that finished its time quantum into the set of expired processes if the eldest expired process has already waited for a long time, or if an expired process has higher static priority (lower value) than the interactive process. As a consequence, the set of active processes will eventually become empty and the expired processes will have a chance to run.
每个实时进程都与一个实时优先级相关联,该实时优先级的值范围从 1(最高优先级)到 99(最低优先级)。调度程序总是优先考虑优先级较高的可运行进程而不是优先级较低的进程;换句话说,实时进程在保持可运行的同时会抑制每个较低优先级进程的执行。与传统流程相反,实时流程始终被视为活动流程(请参阅上一节)。用户可以通过以下方式更改进程的实时优先级sched_setparam( ) 和sched_setscheduler(
) 系统调用(请参阅本章后面的“与调度相关的系统调用”部分)。
Every real-time process is associated with a real-time
priority, which is a value ranging from 1 (highest
priority) to 99 (lowest priority). The scheduler always favors a
higher priority runnable process over a lower priority one; in other
words, a real-time process inhibits the execution of every
lower-priority process while it remains runnable. Contrary to
conventional processes, real-time processes are always considered
active (see the previous section). The user can change the real-time
priority of a process by means of the sched_setparam( ) and sched_setscheduler(
) system calls (see the section "System Calls Related to
Scheduling" later in this chapter).
如果多个实时可运行进程具有相同的最高优先级,调度程序会选择本地CPU运行队列相应列表中第一个出现的进程(参见第3章“TASK_RUNNING进程列表”一节)。
If several real-time runnable processes have the same highest priority, the scheduler chooses the process that occurs first in the corresponding list of the local CPU's runqueue (see the section "The lists of TASK_RUNNING processes" in Chapter 3).
仅当发生以下事件之一时,实时进程才会被另一个进程替换:
A real-time process is replaced by another process only when one of the following events occurs:
该进程被另一个具有更高实时优先级的进程抢占。
The process is preempted by another process having higher real-time priority.
该进程执行阻塞操作,然后进入睡眠状态(处于状态TASK_INTERRUPTIBLE或TASK_UNINTERRUPTIBLE)。
The process performs a blocking operation, and it is put to
sleep (in state TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE).
进程停止(处于状态TASK_STOPPED或TASK_TRACED),或者被终止(处于状态
EXIT_ZOMBIE或EXIT_DEAD)。
The process is stopped (in state TASK_STOPPED or TASK_TRACED), or it is killed (in state
EXIT_ZOMBIE or EXIT_DEAD).
进程通过调用以下命令自愿放弃 CPU
sched_yield( ) 系统调用(请参阅本章后面的“与调度相关的系统调用”部分)。
The process voluntarily relinquishes the CPU by invoking the
sched_yield( ) system call (see the section "System Calls Related to
Scheduling" later in this chapter).
该过程是实时循环(Round Robin SCHED_RR)的,并且它已经耗尽了它的时间量。
The process is Round Robin real-time (SCHED_RR), and it has exhausted its time
quantum.
这nice( ) 和setpriority( )
当系统调用应用于循环实时进程时,不会更改实时优先级,而是更改基本时间量的持续时间。事实上,Round Robin 实时进程的基本时间量的持续时间并不取决于实时优先级,而是取决于进程的静态优先级,根据前面章节“调度”中的公式(1 )传统工艺。”
The nice( ) and setpriority( )
system calls, when applied to a Round Robin real-time
process, do not change the real-time priority but rather the duration
of the base time quantum. In fact, the duration of the base time
quantum of Round Robin real-time processes does not depend on the
real-time priority, but rather on the static priority of the process,
according to the formula (1) in the earlier section "Scheduling of Conventional
Processes."
回想一下第 3 章中的“识别进程”
一节,进程列表链接所有进程描述符,而运行队列列表链接所有可运行进程(即处于某个状态的进程)的进程描述符,
交换
进程(空闲进程)除外)。TASK_RUNNING
Recall from the section "Identifying a Process" in
Chapter 3 that the process
list links all process descriptors, while the runqueue lists link the
process descriptors of all runnable processes—that is, of those in a
TASK_RUNNING state—except the
swapper process (idle process).
该runqueue数据结构是Linux 2.6调度器最重要的数据结构。系统中的每个CPU都有自己的运行队列;所有runqueue结构都存储在
每 CPU 变量中(请参阅第 5 章中的“每 CPU 变量”runqueues部分)。该宏生成本地 CPU 的运行队列的地址,而该
宏生成具有索引 的 CPU 的运行队列的地址。this_rq( )cpu_rq(n)n
The runqueue data structure
is the most important data structure of the Linux 2.6 scheduler. Each
CPU in the system has its own runqueue; all runqueue structures are stored in the
runqueues per-CPU variable (see the
section "Per-CPU
Variables" in Chapter
5). The this_rq( ) macro
yields the address of the runqueue of the local CPU, while the
cpu_rq(n) macro yields the address
of the runqueue of the CPU having index n.
表7-4
列出了数据结构中包含的字段runqueue;我们将在本章的以下各节中讨论其中的大部分内容。
Table 7-4
lists the fields included in the runqueue data structure; we will discuss
most of them in the following sections of the chapter.
表 7-4。runqueue结构体的字段
Table 7-4. The fields of the runqueue structure
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
自旋锁_t spinlock_t | | 自旋锁保护进程列表 Spin lock protecting the lists of processes |
无符号长 unsigned long | | 运行队列列表中可运行进程的数量 Number of runnable processes in the runqueue lists |
无符号长 unsigned long | CPU负载 cpu_load | 基于运行队列中平均进程数的 CPU 负载因子 CPU load factor based on the average number of processes in the runqueue |
无符号长 unsigned long | | CPU执行的进程切换次数 Number of process switches performed by the CPU |
无符号长 unsigned long | | 之前位于运行队列列表中且现在处于睡眠状态的进程数
Number of processes that were
previously in the runqueue lists and are now sleeping in
|
无符号长 unsigned long | | 过期列表中最旧进程的插入时间 Insertion time of the eldest process in the expired lists |
无符号长长 unsigned long long | 时间戳_最后一个时钟周期 timestamp_last_tick | 最后一次定时器中断的时间戳值 Timestamp value of the last timer interrupt |
任务_t * task_t * | | 当前运行进程的进程描述符指针(与 Process descriptor pointer of the
currently running process (same as |
任务_t * task_t * | | 该CPU的交换进程的进程描述符指针 Process descriptor pointer of the swapper process for this CPU |
结构mm_struct * struct mm_struct * | | 在进程切换期间用于存储被替换进程的内存描述符的地址 Used during a process switch to store the address of the memory descriptor of the process being replaced |
prio_array_t * prio_array_t * | | 指向活动进程列表的指针 Pointer to the lists of active processes |
prio_array_t * prio_array_t * | | 指向过期进程列表的指针 Pointer to the lists of expired processes |
prio_array_t [2] prio_array_t [2] | | 两组活跃和过期进程 The two sets of active and expired processes |
整数 int | 最佳过期优先级 best_expired_prio | 过期进程中最好的静态优先级(最低值) The best static priority (lowest value) among the expired processes |
原子_t atomic_t | | 先前位于运行队列列表中且现在正在等待磁盘 I/O 操作完成的进程数 Number of processes that were previously in the runqueue lists and are now waiting for a disk I/O operation to complete |
结构sched_domain * struct sched_domain * | 标准差 sd | 指向该CPU的基本调度域(参见本章后面的“调度域”部分) Points to the base scheduling domain of this CPU (see the section "Scheduling Domains" later in this chapter) |
整数 int | | 如果某个进程应从 该运行队列迁移到另一个(运行队列平衡),则设置标志 Flag set if some process shall be migrated from this runqueue to another (runqueue balancing) |
整数 int | | 不曾用过 Not used |
任务_t * task_t * | | |
结构列表头 struct list_head | | 要从运行队列中删除的进程列表 List of processes to be removed from the runqueue |
数据结构中最重要的字段runqueue是与可运行进程列表相关的字段。系统中的每个可运行进程都属于一个且只有一个运行队列。只要可运行进程保留在同一个运行队列中,它就只能由拥有该运行队列的 CPU 执行。然而,正如我们稍后将看到的,可运行进程可能会从一个运行队列迁移到另一个运行队列。
The most important fields of the runqueue data structure are those related to
the lists of runnable processes. Every runnable process in the system
belongs to one, and just one, runqueue. As long as a runnable process
remains in the same runqueue, it can be executed only by the CPU
owning that runqueue. However, as we'll see later, runnable processes
may migrate from one runqueue to another.
arraysrunqueue 的字段是一个由两个结构体组成的数组prio_array_t。每个数据结构代表一组可运行的进程,并包括 140 个双向链表头(每个可能的进程优先级一个列表)、一个优先级位图以及该组中包含的进程的计数器(参见 章节中的表3-2
)第 3 章)。
The arrays field of the
runqueue is an array consisting of two prio_array_t structures. Each data structure
represents a set of runnable processes, and includes 140 doubly linked
list heads (one list for each possible process priority), a priority
bitmap, and a counter of the processes included in the set (see Table 3-2 in the section
Chapter 3).
如图7-1所示,该结构active体的字段
runqueue指向以下两个prio_array_t数据结构之一arrays: 对应的可运行进程集包括活动进程。相反,该expired字段指向 中的另一个prio_array_t
数据结构arrays:相应的可运行进程集包括过期进程。
As shown in Figure
7-1, the active field of the
runqueue structure points to one of
the two prio_array_t data
structures in arrays: the
corresponding set of runnable processes includes the active processes.
Conversely, the expired field
points to the other prio_array_t
data structure in arrays: the
corresponding set of runnable processes includes the expired
processes.
这两种数据结构的角色会周期性地发生arrays变化:活动进程突然变成过期进程,过期进程变成活动进程。为了实现此更改,调度程序只需交换运行队列的active和字段的内容。expired
Periodically, the role of the two data structures in arrays changes: the active processes
suddenly become the expired processes, and the expired processes
become the active ones. To achieve this change, the scheduler simply
exchanges the contents of the active and expired fields of the runqueue.
每个进程描述符包含几个与调度相关的字段;它们列于表7-5。
Each process descriptor includes several fields related to scheduling; they are listed in Table 7-5.
表 7-5。与调度程序相关的进程描述符字段
Table 7-5. Fields of the process descriptor related to the scheduler
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
无符号长 unsigned long | 线程信息->标志 thread_info->flags | 存储 Stores the |
无符号整数 unsigned int | 线程信息->CPU thread_info->cpu | 拥有可运行进程所属的运行队列的CPU的逻辑编号 Logical number of the CPU owning the runqueue to which the runnable process belongs |
无符号长 unsigned long | 状态 state | The current state of the process (see the section "Process State" in Chapter 3) |
| | 进程的动态优先级 Dynamic priority of the process |
| | 进程的静态优先级 Static priority of the process |
| | 指向进程所属的运行队列列表中的下一个和上一个元素的指针 Pointers to the next and previous elements in the runqueue list to which the process belongs |
| | 指向 Pointer to the runqueue's |
| | 进程的平均睡眠时间 Average sleep time of the process |
| | 进程最后一次插入运行队列的时间,或者最后一次涉及该进程的进程切换的时间 Time of last insertion of the process in the runqueue, or time of last process switch involving the process |
| | 最后一次进程切换取代该进程的时间 Time of last process switch that replaced the process |
| | 进程被唤醒时使用的条件码 Condition code used when the process is awakened |
| | 进程的调度类( The scheduling class of the process
( |
| | 可以执行该进程的CPU的位掩码 Bit mask of the CPUs that can execute the process |
| | 进程时间量中剩余的刻度 Ticks left in the time quantum of the process |
| | 如果进程从未耗尽其时间量,则标志设置为 1 Flag set to 1 if the process never exhausted its time quantum |
| | 进程的实时优先级 Real-time priority of the process |
创建新进程时,sched_fork( )由 调用,按以下方式copy_process( )设置(父)和(子)进程time_slice的字段:currentp
When a new process is created, sched_fork( ), invoked by copy_process( ), sets the time_slice field of both current (the parent) and p (the child) processes in the following
way:
p->time_slice = (当前->time_slice + 1) >> 1; 当前->时间片>>= 1;
p->time_slice = (current->time_slice + 1) >> 1; current->time_slice >>= 1;
换句话说,留给父进程的刻度数被分成两半:一份给父进程,另一半给子进程。这样做是为了防止用户获得无限量的CPU时间,方法如下:父进程创建一个运行相同代码的子进程,然后杀死自己;通过适当调整创建速率,子进程始终会在其父进程的量程到期之前获得新的量程。这个编程技巧不起作用,因为内核不奖励分叉。同样,用户不能通过在 shell 中启动多个后台进程或在图形桌面上打开大量窗口来占用不公平的处理器份额。更一般地说,
In other words, the number of ticks left to the parent is split in two halves: one for the parent and one for the child. This is done to prevent users from getting an unlimited amount of CPU time by using the following method: the parent process creates a child process that runs the same code and then kills itself; by properly adjusting the creation rate, the child process would always get a fresh quantum before the quantum of its parent expires. This programming trick does not work because the kernel does not reward forks. Similarly, a user cannot hog an unfair share of the processor by starting several background processes in a shell or by opening a lot of windows on a graphical desktop. More generally speaking, a process cannot hog resources (unless it has privileges to give itself a real-time policy) by forking multiple descendents.
如果父级在其时间片中只剩下一个时钟周期,则分裂操作会强制current->time_slice为 0,从而耗尽父级的时间片。在这种情况下,copy_process( )设置current->time_slice回 1,然后调用scheduler_tick( )以减少该字段(请参阅以下部分)。
If the parent had just one tick left in its time slice, the
splitting operation forces current->time_slice to 0, thus exhausting
the quantum of the parent. In this case, copy_process( ) sets current->time_slice back to 1, then
invokes scheduler_tick( ) to
decrease the field (see the following section).
该copy_process( )函数还初始化子进程描述符中与调度相关的一些其他字段:
The copy_process( ) function
also initializes a few other fields of the child's process descriptor
related to scheduling:
p->first_time_slice = 1; p->时间戳 = sched_clock( );
p->first_time_slice = 1; p->timestamp = sched_clock( );
该first_time_slice标志设置为 1,因为子进程从未用完其时间片(如果进程在其第一个时间片期间终止或执行新程序,则父进程将获得子进程的剩余时间片奖励)。该timestamp字段使用由 生成的时间戳值进行初始化sched_clock( ):本质上,该函数返回转换为纳秒的 64 位 TSC 寄存器的内容(请参阅第 6 章中的“时间戳计数器(TSC) ”部分)。
The first_time_slice flag is
set to 1, because the child has never exhausted its time quantum (if a
process terminates or executes a new program during its first time
slice, the parent process is rewarded with the remaining time slice of
the child). The timestamp field is
initialized with a timestamp value produced by sched_clock( ): essentially, this function
returns the contents of the 64-bit TSC register (see the section
"Time Stamp Counter
(TSC)" in Chapter 6)
converted to nanoseconds.
The scheduler relies on several functions in order to do its work; the most important are:
scheduler_tick( )scheduler_tick( )保持time_slice
计数器current
最新
Keeps the time_slice
counter of current
up-to-date
try_to_wake_up( )try_to_wake_up( )唤醒沉睡的过程
Awakens a sleeping process
recalc_task_prio( )recalc_task_prio( )更新进程的动态优先级
Updates the dynamic priority of a process
schedule( )schedule( )选择要执行的新进程
Selects a new process to be executed
load_balance()load_balance()保持多处理器系统的运行队列平衡
Keeps the runqueues of a multiprocessor system balanced
我们已经在第 6 章的“更新本地 CPU 统计信息”一节中解释了如何在每个时钟周期调用一次来执行一些与调度相关的操作。它执行以下主要步骤:scheduler_tick( )
We have already explained in the section "Updating Local CPU
Statistics" in Chapter
6 how scheduler_tick( ) is
invoked once every tick to perform some operations related to
scheduling. It executes the following main steps:
timestamp_last_tick将转换为纳秒的 TSC 当前值存储在本地运行队列的字段中;该时间戳是从sched_clock( )函数中获取的(请参阅上一节)。
Stores in the timestamp_last_tick field of the local
runqueue the current value of the TSC converted to nanoseconds;
this timestamp is obtained from the sched_clock( ) function (see the
previous section).
检查当前进程是否为本 机CPU的swapper进程。如果是,则执行以下子步骤:
如果本地运行队列包含除swapper之外的另一个可运行进程,它会设置TIF_NEED_RESCHED当前进程的标志以强制重新调度。正如我们将在本章后面的“ schedule()函数”部分中看到的,如果内核支持超线程技术(请参阅“多处理器系统中的运行队列平衡”部分)” 在本章后面),即使其运行队列中有可运行的进程,逻辑 CPU 也可能处于空闲状态,只要这些进程的优先级明显低于已在与同一物理 CPU 关联的另一个逻辑 CPU 上执行的进程的优先级。
跳转到步骤7(无需更新交换器 进程的时间片计数器)。
Checks whether the current process is the swapper process of the local CPU. If so, it performs the following substeps:
If the local runqueue includes another runnable process
besides swapper, it sets the TIF_NEED_RESCHED flag of the current
process to force rescheduling. As we'll see in the section
"The schedule( )
Function" later in this chapter, if the kernel supports
the hyper-threading technology (see the section "Runqueue Balancing in
Multiprocessor Systems" later in this chapter), a
logical CPU might be idle even if there are runnable processes
in its runqueue, as long as those processes have significantly
lower priorities than the priority of a process already
executing on another logical CPU associated with the same
physical CPU.
Jumps to step 7 (there is no need to update the time slice counter of the swapper process).
检查是否current->array指向本地运行队列的活动列表。如果不是,则进程已过期其时间量,但尚未被替换:设置标志TIF_NEED_RESCHED以强制重新调度,并跳转到步骤 7。
Checks whether current->array points to the active
list of the local runqueue. If not, the process has expired its
time quantum, but it has not yet been replaced: sets the TIF_NEED_RESCHED flag to force
rescheduling, and jumps to step 7.
获取this_rq()->lock自旋锁。
Acquires the this_rq()->lock spin lock.
减少当前进程的时间片计数器,并检查时间片是否耗尽。根据进程的调度类别,该函数执行的操作有很大不同;我们稍后将讨论它们。
Decreases the time slice counter of the current process, and checks whether the quantum is exhausted. The operations performed by the function are quite different according to the scheduling class of the process; we will discuss them in a moment.
释放this_rq(
)->lock自旋锁。
Releases the this_rq(
)->lock spin lock.
调用该rebalance_tick(
)函数,该函数应确保各个 CPU 的运行队列包含大致相同数量的可运行进程。我们将在后面的“多处理器系统中的运行队列平衡”部分中讨论运行队列平衡。
Invokes the rebalance_tick(
) function, which should ensure that the runqueues of
the various CPUs contain approximately the same number of runnable
processes. We will discuss runqueue balancing in the later section
"Runqueue Balancing in
Multiprocessor Systems."
如果当前进程是一个FIFO实时进程,
scheduler_tick( )则无关。事实上,在这种情况下,current不能被较低或相同优先级的进程抢占,因此保持其时间片计数器最新是没有意义的。
If the current process is a FIFO real-time process,
scheduler_tick( ) has nothing to
do. In this case, in fact, current cannot be preempted by lower or
equal priority processes, thus it does not make sense to keep its
time slice counter up-to-date.
如果current是一个循环实时进程,scheduler_tick(
)则减少其时间片计数器并检查时间片是否耗尽:
If current is a Round Robin
real-time process, scheduler_tick(
) decreases its time slice counter and checks whether the
quantum is exhausted:
if (当前->策略== SCHED_RR && !--当前->时间片) {
当前->time_slice = task_timeslice(当前);
当前->first_time_slice = 0;
set_tsk_need_resched(当前);
list_del(&当前->run_list);
list_add_tail(&当前->run_list,
this_rq( )->活动->队列+当前->prio);
}if (current->policy == SCHED_RR && !--current->time_slice) {
current->time_slice = task_timeslice(current);
current->first_time_slice = 0;
set_tsk_need_resched(current);
list_del(¤t->run_list);
list_add_tail(¤t->run_list,
this_rq( )->active->queue+current->prio);
}如果该函数确定时间量已有效耗尽,它将执行一些操作,旨在确保current在必要时尽快被抢占。
If the function determines that the time quantum is
effectively exhausted, it performs a few operations aimed to ensure
that current will be preempted,
if necessary, as soon as possible.
第一个操作包括通过调用重新填充进程的时间片计数器task_timeslice( )。该函数考虑进程的静态优先级,并根据前面“常规进程的调度”部分中所示的公式(1)返回相应的基本时间量。此外,该first_time_slice字段被清除:该标志由在服务例程中current设置
copy_process( )fork( ) 系统调用,并且应该在进程的第一个时间片过去后立即清除。
The first operation consists of refilling the time slice
counter of the process by invoking task_timeslice( ). This function considers
the static priority of the process and returns the corresponding
base time quantum, according to the formula (1) shown in the earlier
section "Scheduling of
Conventional Processes." Moreover, the first_time_slice field of current is cleared: this flag is set by
copy_process( ) in the service
routine of the fork( ) system call, and should be cleared as soon as the
first time quantum of the process elapses.
接下来,scheduler_tick( )
调用该set_tsk_need_resched(
)函数来设置TIF_NEED_RESCHED进程的标志。正如第 4 章“从中断和异常中返回”部分所述,该标志强制调用该函数,以便可以用另一个具有相同(或更高)优先级的实时进程(如果有)来替换该函数。schedule( )current
Next, scheduler_tick( )
invokes the set_tsk_need_resched(
) function to set the TIF_NEED_RESCHED flag of the process. As
described in the section "Returning from Interrupts and
Exceptions" in Chapter
4, this flag forces the invocation of the schedule( ) function, so that current can be replaced by another
real-time process having equal (or higher) priority, if any.
的最后一个操作scheduler_tick(
)包括将进程描述符移动到与 的优先级相对应的运行队列活动列表的最后一个位置current。放置在最后一个位置可确保在每个具有相同优先级的实时可运行进程都获得 CPU 时间片current之前,它不会再次被选择执行。current这就是循环调度的意义。通过首先调用list_del(
)将进程从运行队列活动列表中删除,然后调用list_add_tail( )
将进程插回同一列表的最后位置来移动描述符。
The last operation of scheduler_tick(
) consists of moving the process descriptor to the last
position of the runqueue active list corresponding to the priority
of current. Placing current in the last position ensures that
it will not be selected again for execution until every real-time
runnable process having the same priority as current will get a slice of the CPU time.
This is the meaning of round-robin scheduling. The descriptor is
moved by first invoking list_del(
) to remove the process from the runqueue active list,
then by invoking list_add_tail( )
to insert back the process in the last position of the same
list.
如果当前进程是常规进程,则该
scheduler_tick( )函数执行以下操作:
If the current process is a conventional process, the
scheduler_tick( ) function
performs the following operations:
减少时间片计数器 ( current->time_slice)。
Decreases the time slice counter (current->time_slice).
检查时间片计数器。如果时间量耗尽,该函数将执行以下操作:
调用以从可运行进程集中dequeue_task(
)删除。currentthis_rq( )->active
调用set_tsk_need_resched(
)以设置TIF_NEED_RESCHED标志。
更新 的动态优先级current:
当前->prio = effective_prio(current);
该effective_prio(
)函数读取 的static_prio和sleep_avg字段current,并根据前面“常规进程的调度”一节中所示的公式(2)计算进程的动态优先级。
重新填充进程的时间量:
当前->time_slice = task_timeslice(当前);
当前->first_time_slice = 0;如果expired_timestamp本地运行队列数据结构的字段等于0(即过期进程集为空),则将当前tick的值写入该字段:
if (!this_rq( )->expired_timestamp)
this_rq( )->expired_timestamp = jiffies;将当前进程插入到活动集中或过期集中:
if (!TASK_INTERACTIVE(当前) || EXPIRED_STARVING(this_rq( )) {
enqueue_task(当前, this_rq( )->过期);
if (current->static_prio < this_rq( )->best_expired_prio)
this_rq( )->best_expired_prio = current->static_prio;
} 别的
enqueue_task(当前, this_rq( )->活动);如果使用前面部分“常规进程的调度”TASK_INTERACTIVE中所示的公式 (3) 将进程识别为交互式,则宏将生成值 1 。该宏检查运行队列中第一个过期进程是否必须等待超过 1000 个时钟周期乘以运行队列中可运行进程数加一;如果是,则宏产生值 1。如果当前进程的静态优先级值大于已过期进程的静态优先级值,则宏还会生成值 1 。EXPIRED_STARVINGEXPIRED_STARVING
Checks the time slice counter. If the time quantum is exhausted, the function performs the following operations:
Invokes dequeue_task(
) to remove current from the this_rq( )->active set of
runnable processes.
Invokes set_tsk_need_resched(
) to set the TIF_NEED_RESCHED flag.
Updates the dynamic priority of current:
current->prio = effective_prio(current);
The effective_prio(
) function reads the static_prio and sleep_avg fields of current, and computes the dynamic
priority of the process according to the formula (2) shown
in the earlier section "Scheduling of
Conventional Processes."
Refills the time quantum of the process:
current->time_slice = task_timeslice(current);
current->first_time_slice = 0;If the expired_timestamp field of the
local runqueue data structure is equal to zero (that is, the
set of expired processes is empty), writes into the field
the value of the current tick:
if (!this_rq( )->expired_timestamp)
this_rq( )->expired_timestamp = jiffies;Inserts the current process either in the active set or in the expired set:
if (!TASK_INTERACTIVE(current) || EXPIRED_STARVING(this_rq( )) {
enqueue_task(current, this_rq( )->expired);
if (current->static_prio < this_rq( )->best_expired_prio)
this_rq( )->best_expired_prio = current->static_prio;
} else
enqueue_task(current, this_rq( )->active);The TASK_INTERACTIVE macro yields the
value one if the process is recognized as interactive using
the formula (3) shown in the earlier section "Scheduling of
Conventional Processes." The EXPIRED_STARVING macro checks
whether the first expired process in the runqueue had to
wait for more than 1000 ticks times the number of runnable
processes in the runqueue plus one; if so, the macro yields
the value one. The EXPIRED_STARVING macro also yields
the value one if the static priority value of the current
process is greater than the static priority value of an
already expired process.
否则,如果时间片未用完(current->time_slice不为零),则检查当前进程的剩余时间片是否太长:
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
(p->数组 == rq->活动)) {
list_del(&当前->run_list);
list_add_tail(&当前->run_list,
this_rq( )->活动->队列+当前->prio);
set_tsk_need_resched(p);
}该TIMESLICE_GRANULARITY宏生成系统中 CPU 数量和与当前进程的奖励成比例的常量的乘积(参见本章前面的表 7-3 )。基本上,具有高静态优先级的交互进程的时间量被分割成几块TIMESLICE_GRANULARITY大小,这样它们就不会独占CPU。
Otherwise, if the time quantum is not exhausted (current->time_slice is not zero),
checks whether the remaining time slice of the current process
is too long:
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
(p->array == rq->active)) {
list_del(¤t->run_list);
list_add_tail(¤t->run_list,
this_rq( )->active->queue+current->prio);
set_tsk_need_resched(p);
}The TIMESLICE_GRANULARITY macro yields the
product of the number of CPUs in the system and a constant
proportional to the bonus of the current process (see Table 7-3 earlier in
the chapter). Basically, the time quantum of interactive
processes with high static priorities is split into several
pieces of TIMESLICE_GRANULARITY size, so that
they do not monopolize the CPU.
该函数通过将其状态设置为并将其插入本地 CPU 的运行队列来try_to_wake_up( )
唤醒休眠或停止的进程
。例如,调用该函数来唤醒等待队列中包含的进程(请参阅第 3 章中的“进程如何组织”TASK_RUNNING部分)或恢复等待信号的进程的执行(请参阅第 11 章)。该函数接收以下参数作为其参数:
The try_to_wake_up( )
function awakes a sleeping or stopped process by setting its state to
TASK_RUNNING and inserting it into
the runqueue of the local CPU. For instance, the function is invoked
to wake up processes included in a wait queue (see the section "How Processes Are
Organized" in Chapter
3) or to resume execution of processes waiting for a signal
(see Chapter 11). The
function receives as its parameters:
p被唤醒进程的描述符指针( )
The descriptor pointer (p) of the process to be awakened
进程的掩码表明 ( state) 可以被唤醒
A mask of the process states (state) that can be awakened
标志(sync),禁止被唤醒的进程抢占当前在本地CPU上运行的进程
A flag (sync) that
forbids the awakened process to preempt the process currently
running on the local CPU
该函数执行以下操作:
The function performs the following operations:
调用该函数来禁用本地中断并获取
最后执行进程的CPU(可能与本地CPU不同)拥有的task_rq_lock(
)运行队列的锁。rq该 CPU 的逻辑编号存储在该p->thread_info->cpu字段中。
Invokes the task_rq_lock(
) function to disable local interrupts and to acquire
the lock of the runqueue rq
owned by the CPU that was last executing the process (it could be
different from the local CPU). The logical number of that CPU is
stored in the p->thread_info->cpu field.
检查进程的状态是否属于作为参数传递给函数的状态p->state掩码;state如果不是这种情况,则跳转到步骤 9 终止该函数。
Checks if the state of the process p->state belongs to the mask of
states state passed as argument
to the function; if this is not the case, it jumps to step 9 to
terminate the function.
如果该p->array字段不是NULL,则该进程已属于运行队列;因此,它跳转到步骤 8。
If the p->array field
is not NULL, the process
already belongs to a runqueue; therefore, it jumps to step
8.
在多处理器系统中,它检查要唤醒的进程是否应该从最后执行的CPU的运行队列迁移到另一个CPU的运行队列。本质上,该函数根据一些启发式规则选择目标运行队列。例如:
如果系统中的某个CPU空闲,它会选择其运行队列作为目标。按顺序优先考虑先前执行的 CPU 和本地 CPU。
如果先前执行的CPU的工作负载明显低于本地CPU的工作负载,则选择旧的运行队列作为目标。
如果进程最近被执行过,它会选择旧的运行队列作为目标(硬件缓存可能仍然充满进程的数据)。
如果将进程移至本地 CPU 可以减少 CPU 之间的不平衡,则目标是本地运行队列(请参阅本章后面的“多处理器系统中的运行队列平衡”部分)。
rq执行此步骤后,该函数已识别出将执行唤醒进程的目标CPU,以及相应的要在其中插入进程的目标运行队列。
In multiprocessor systems, it checks whether the process to be awakened should be migrated from the runqueue of the lastly executing CPU to the runqueue of another CPU. Essentially, the function selects a target runqueue according to some heuristic rules. For example:
If some CPU in the system is idle, it selects its runqueue as the target. Preference is given to the previously executing CPU and to the local CPU, in this order.
If the workload of the previously executing CPU is significantly lower than the workload of the local CPU, it selects the old runqueue as the target.
If the process has been executed recently, it selects the old runqueue as the target (the hardware cache might still be filled with the data of the process).
If moving the process to the local CPU reduces the unbalance between the CPUs, the target is the local runqueue (see the section "Runqueue Balancing in Multiprocessor Systems" later in this chapter).
After this step has been executed, the function has
identified a target CPU that will execute the awakened process
and, correspondingly, a target runqueue rq in which to insert the
process.
如果进程处于该TASK_UNINTERRUPTIBLE状态,则减少nr_uninterruptible目标运行队列的字段,并将p->activated进程描述符的字段设置为1。有关该
字段的说明,请参阅后面的“ recalc_task_prio( ) 函数-”部分。activated
If the process is in the TASK_UNINTERRUPTIBLE state, it decreases
the nr_uninterruptible field of
the target runqueue, and sets the p->activated field of the process
descriptor to -1. See the later
section "The
recalc_task_prio( ) Function" for an explanation of the
activated field.
调用该activate_task(
)函数,该函数依次执行以下子步骤:
调用sched_clock(
)以获取当前时间戳(以纳秒为单位)。如果目标 CPU 不是本地 CPU,它会使用相对于本地和目标 CPU 上最后一次出现的定时器中断的时间戳来补偿本地定时器中断的漂移:
现在 = (sched_clock() - this_rq()->timestamp_last_tick)
+ rq->timestamp_last_tick;调用recalc_task_prio(
),向其传递进程描述符指针和上一步中计算的时间戳。该recalc_task_prio( )功能将在下一节中描述。
p->activated根据
本章后面的表 7-6设置字段的值。
p->timestamp使用步骤 6a 中计算的时间戳设置字段。
将进程描述符插入活动集中:
enqueue_task(p, rq->活动); rq->nr_running++;
Invokes the activate_task(
) function, which in turn performs the following
substeps:
Invokes sched_clock(
) to get the current timestamp in nanoseconds. If
the target CPU is not the local CPU, it compensates for the
drift of the local timer interrupts by using the timestamps
relative to the last occurrences of the timer interrupts on
the local and target CPUs:
now = (sched_clock( ) - this_rq( )->timestamp_last_tick)
+ rq->timestamp_last_tick;Invokes recalc_task_prio(
), passing to it the process descriptor pointer and
the timestamp computed in the previous step. The recalc_task_prio( ) function is
described in the next section.
Sets the value of the p->activated field according to
Table 7-6
later in this chapter.
Sets the p->timestamp field with the
timestamp computed in step 6a.
Inserts the process descriptor in the active set:
enqueue_task(p, rq->active); rq->nr_running++;
如果目标CPU不是本地CPU或者
sync没有设置该标志,则检查新的可运行进程的动态优先级是否高于运行队列中当前进程的动态优先级rq(p->prio < rq->curr->prio);如果是,则调用resched_task( )
抢占rq->curr。在单处理器系统中,后一个函数仅执行set_tsk_need_resched( )以设置
进程TIF_NEED_RESCHED的标志
rq->curr。在多处理器系统中resched_task(
)还检查flag的旧值是否TIF_NEED_RESCHED为零,目标CPU是否与本地CPU不同,以及
TIF_POLLING_NRFLAGflag的旧值是否为零。
rq->curr进程已清除(目标CPU没有主动轮询TIF_NEED_RESCHED进程标志的状态)。如果是这样,resched_task( )则调用
smp_send_reschedule( )以引发 IPI 并强制在目标 CPU 上重新调度(请参阅第 4 章中的“处理器间中断处理”部分)。
If either the target CPU is not the local CPU or if the
sync flag is not set, it checks
whether the new runnable process has a dynamic priority higher
than that of the current process of the rq runqueue (p->prio < rq->curr->prio);
if so, invokes resched_task( )
to preempt rq->curr. In
uniprocessor systems the latter function just executes set_tsk_need_resched( ) to set the
TIF_NEED_RESCHED flag of the
rq->curr process. In
multiprocessor systems resched_task(
) also checks whether the old value of whether TIF_NEED_RESCHED flag was zero, the
target CPU is different from the local CPU, and whether the
TIF_POLLING_NRFLAG flag of the
rq->curr process is clear
(the target CPU is not actively polling the status of the TIF_NEED_RESCHED flag of the process).
If so, resched_task( ) invokes
smp_send_reschedule( ) to raise
an IPI and force rescheduling on the target CPU (see the section
"Interprocessor
Interrupt Handling" in Chapter 4).
将p->state
进程的字段设置为TASK_RUNNING。
Sets the p->state
field of the process to TASK_RUNNING.
调用task_rq_unlock( )
解锁rq运行队列并重新启用本地中断。
Invokes task_rq_unlock( )
to unlock the rq runqueue and
reenable the local interrupts.
返回1(如果进程已成功唤醒)或0(如果进程尚未唤醒)。
Returns 1 (if the process has been successfully awakened) or 0 (if the process has not been awakened).
该recalc_task_prio(
)函数更新进程的平均睡眠时间和动态优先级。它接收进程描述符指针和函数计算的
p时间戳作为其参数。nowsched_clock( )
The recalc_task_prio(
) function updates the average sleep time and the dynamic
priority of a process. It receives as its parameters a process
descriptor pointer p and a
timestamp now computed by the
sched_clock( ) function.
该函数执行以下操作:
The function executes the following operations:
sleep_time
将以下结果存储在局部变量中:
分钟(现在 - p->时间戳,10 9)
该p->timestamp字段包含使进程进入睡眠状态的进程切换的时间戳;因此,sleep_time
存储进程自上次执行以来休眠所花费的纳秒数(如果进程休眠时间较长,则相当于 1 秒)。
Stores in the sleep_time
local variable the result of:
min (now − p->timestamp, 109 )
The p->timestamp field
contains the timestamp of the process switch that put the process
to sleep; therefore, sleep_time
stores the number of nanoseconds that the process spent sleeping
since its last execution (or the equivalent of 1 second, if the
process slept more).
如果sleep_time不大于0,则跳至步骤8,跳过更新进程的平均休眠时间。
If sleep_time is not
greater than zero, it jumps to step 8 so as to skip updating the
average sleep time of the process.
检查进程是否不是内核线程,是否正在从状态中唤醒TASK_UNINTERRUPTIBLE(p->activated字段等于-1;请参阅上一节中的步骤5),以及是否持续休眠超过给定的休眠时间阈值。如果满足这三个条件,则该函数将该p->sleep_avg字段设置为相当于 900 个刻度(通过从最大平均睡眠时间减去标准进程的基本时间量的持续时间获得的经验值)。然后,跳转到步骤8。
睡眠时间阈值取决于进程的静态优先级;一些典型值如表7-2所示。简而言之,这个经验规则的目标是确保在不间断模式下长时间休眠的进程(通常等待磁盘 I/O 操作)获得一个预定义的睡眠平均值,该平均值足够大以允许它们很快就能得到服务,但它也不会大到导致其他进程饥饿。
Checks whether the process is not a
kernel thread, whether it is awakening from the TASK_UNINTERRUPTIBLE state (p->activated field equal to −1; see
step 5 in the previous section), and whether it has been
continuously asleep beyond a given sleep time threshold. If these
three conditions are fulfilled, the function sets the p->sleep_avg field to the equivalent
of 900 ticks (an empirical value obtained by subtracting the
duration of the base time quantum of a standard process from the
maximum average sleep time). Then, it jumps to step
8.
The sleep time threshold depends on the static priority of the process; some typical values are shown in Table 7-2. In short, the goal of this empirical rule is to ensure that processes that have been asleep for a long time in uninterruptible mode—usually waiting for disk I/O operations—get a predefined sleep average value that is large enough to allow them to be quickly serviced, but it is also not so large to cause starvation for other processes.
执行CURRENT_BONUS宏来计算
奖金 进程之前的平均睡眠时间值(见表7-3)。如果 (10 -
Bonus ) 大于零,则该函数将乘以sleep_time该值。由于sleep_time将添加到进程的平均睡眠时间中(参见下面的步骤6),因此当前平均睡眠时间越低,上升的速度就越快。
Executes the CURRENT_BONUS macro to compute the
bonus value of the previous average sleep time of the
process (see Table
7-3). If (10 -
bonus) is greater than zero, the function
multiplies sleep_time by this
value. Since sleep_time will be
added to the average sleep time of the process (see step 6 below),
the lower the current average sleep time is, the more rapidly it
will rise.
如果进程处于TASK_UNINTERRUPTIBLE模式并且不是内核线程,它将执行以下子步骤:
检查平均睡眠时间是否p->sleep_avg大于或等于其睡眠时间阈值(请参阅本章前面的表 7-2 )。如果是,它将sleep_avg局部变量重置为零,从而跳过平均睡眠时间的调整,并跳转到步骤 6。
如果总和sleep_avg +
p->sleep_avg大于或等于睡眠时间阈值,则将该p->sleep_avg字段设置为睡眠时间阈值,并设置sleep_avg为零。
通过对进程平均睡眠时间的增量进行一定程度的限制,该函数不会奖励太多长时间睡眠的批处理进程。
If the process is in TASK_UNINTERRUPTIBLE mode and it is not
a kernel thread, it performs the following substeps:
Checks whether the average sleep time p->sleep_avg is greater than or
equal to its sleep time threshold (see Table 7-2 earlier
in this chapter). If so, it resets the sleep_avg local variable to
zero—thus skipping the adjustment of the average sleep
time—and jumps to step 6.
If the sum sleep_avg +
p->sleep_avg is greater than or equal to the
sleep time threshold, it sets the p->sleep_avg field to the sleep
time threshold, and sets sleep_avg to zero.
By somewhat limiting the increment of the average sleep time of the process, the function does not reward too much batch processes that sleep for a long time.
添加sleep_time到进程的平均睡眠时间 ( p->sleep_avg)。
Adds sleep_time to the
average sleep time of the process (p->sleep_avg).
检查是否p->sleep_avg超过 1000 个刻度(以纳秒为单位);如果是这样,该函数会将其减少到 1000 个刻度(以纳秒为单位)。
Checks whether p->sleep_avg exceeds 1000 ticks (in
nanoseconds); if so, the function cuts it down to 1000 ticks (in
nanoseconds).
更新进程的动态优先级:
p->prio = effective_prio(p);
该函数已在本章前面的“ scheduler_tick() 函数effective_prio( )”
部分中讨论过。
Updates the dynamic priority of the process:
p->prio = effective_prio(p);
The effective_prio( )
function has already been discussed in the section "The scheduler_tick( )
Function" earlier in this chapter.
该schedule( )
函数实现了调度程序。它的目标是在运行队列列表中找到一个进程,然后为其分配CPU。它由多个内核例程直接或以惰性(延迟)方式调用。
The schedule( )
function implements the scheduler. Its objective is to find a process
in the runqueue list and then assign the CPU to it. It is invoked,
directly or in a lazy (deferred) way, by several kernel
routines.
当进程current由于所需资源不可用而必须立即阻塞时,会直接调用调度程序。在这种情况下,想要阻止它的内核例程按如下方式进行:
The scheduler is invoked directly when the current process must be blocked right away
because the resource it needs is not available. In this case, the
kernel routine that wants to block it proceeds as follows:
插入current到正确的等待队列中。
Inserts current in the
proper wait queue.
current更改toTASK_INTERRUPTIBLE或 to的状态TASK_UNINTERRUPTIBLE。
Changes the state of current either to TASK_INTERRUPTIBLE or to TASK_UNINTERRUPTIBLE.
调用schedule(
).
Invokes schedule(
).
检查资源是否可用;如果不是,则转至步骤 2。
Checks whether the resource is available; if not, goes to step 2.
一旦资源可用,就current从等待队列中删除。
Once the resource is available, removes current from the wait queue.
内核例程反复检查进程所需的资源是否可用;如果没有,它会通过调用将 CPU 让给其他进程schedule(
)。稍后,当调度程序再次将 CPU 授予该进程时,将重新检查资源的可用性。这些步骤与第 3 章“流程的组织方式”wait_event( )部分所执行的步骤和功能类似。
The kernel routine checks repeatedly whether the resource
needed by the process is available; if not, it yields the CPU to
some other process by invoking schedule(
). Later, when the scheduler once again grants the CPU to
the process, the availability of the resource is rechecked. These
steps are similar to those performed by wait_event( ) and similar functions
described in the section "How Processes Are
Organized" in Chapter
3.
许多执行长迭代任务的设备驱动程序也直接调用调度程序。在每个迭代周期,驱动程序都会检查标志的值TIF_NEED_RESCHED,并在必要时调用schedule( )以自愿放弃 CPU。
The scheduler is also directly invoked by many device drivers
that execute long iterative tasks. At each iteration cycle, the
driver checks the value of the TIF_NEED_RESCHED flag and, if necessary,
invokes schedule( ) to
voluntarily relinquish the CPU.
TIF_NEED_RESCHED还可以通过将的标志设置
current为 1 以惰性方式调用调度程序。因为在恢复执行用户模式进程之前始终会检查此标志的值(请参阅“从中断和异常返回”部分)在第 4 章中),schedule( )肯定会在不久的将来的某个时候被调用。
The scheduler can also be invoked in a lazy way by setting the
TIF_NEED_RESCHED flag of current to 1. Because a check on the value
of this flag is always made before resuming the execution of a User
Mode process (see the section "Returning from Interrupts and
Exceptions" in Chapter
4), schedule( ) will
definitely be invoked at some time in the near future.
Typical examples of lazy invocation of the scheduler are:
何时current用完其 CPU 时间量;这是由scheduler_tick( )函数完成的。
When current has used
up its quantum of CPU time; this is done by the scheduler_tick( ) function.
当一个进程被唤醒并且其优先级高于当前进程时;该任务由该函数执行
try_to_wake_up( )
。
When a process is woken up and its priority is higher than
that of the current process; this task is performed by the
try_to_wake_up( )
function.
当sched_setscheduler(
)发出系统调用时(请参阅本章后面的“与调度相关的系统调用”部分)。
When a sched_setscheduler(
) system call is issued (see the section "System Calls Related to
Scheduling" later in this chapter).
该schedule(
)函数的目标包括用另一个进程替换当前正在执行的进程。因此,该函数的关键结果是设置一个名为 的局部变量next,以便它指向选定要替换的进程的描述符current。如果系统中没有可运行进程的优先级大于 的优先级current,则最后next符合,current并且不发生进程切换。
The goal of the schedule(
) function consists of replacing the currently executing
process with another one. Thus, the key outcome of the function is
to set a local variable called next, so that it points to the descriptor
of the process selected to replace current. If no runnable process in the
system has priority greater than the priority of current, at the end, next coincides with current and no process switch takes
place.
该schedule( )函数首先禁用内核抢占并初始化一些局部变量:
The schedule( ) function
starts by disabling kernel preemption and initializing a few local
variables:
需要重新安排: preempt_disable(); 上一个 = 当前; rq = this_rq( );
need_resched: preempt_disable( ); prev = current; rq = this_rq( );
current可以看到,中保存的是返回的指针prev, 中保存的是本地CPU对应的runqueue数据结构的地址rq。
As you see, the pointer returned by current is saved in prev, and the address of the runqueue data
structure corresponding to the local CPU is saved in rq.
接下来,schedule( )确保prev不持有大内核锁(参见第 5 章中的“大内核锁”
部分):
Next, schedule( ) makes
sure that prev doesn't hold the
big kernel lock (see the section "The Big Kernel Lock" in
Chapter 5):
if (上一个->lock_深度 >= 0)
向上(&kernel_sem);if (prev->lock_depth >= 0)
up(&kernel_sem);请注意,这schedule( )
不会更改该lock_depth字段的值;而是会更改该字段的值。当prev恢复执行时,kernel_flag如果该字段的值不为负,则重新获取自旋锁。因此,大内核锁会在进程切换时自动释放和重新获取。
Notice that schedule( )
doesn't change the value of the lock_depth field; when prev resumes its execution, it reacquires
the kernel_flag spin lock if the
value of this field is not negative. Thus, the big kernel lock is
automatically released and reacquired across a process
switch.
调用该sched_clock( )函数读取 TSC 并将其值转换为纳秒;获取的时间戳保存在now局部变量中。然后,schedule( )计算 所使用的 CPU 时间片的持续时间prev:
The sched_clock( ) function
is invoked to read the TSC and convert its value to nanoseconds; the
timestamp obtained is saved in the now local variable. Then, schedule( ) computes the duration of the
CPU time slice used by prev:
现在= sched_clock();
run_time = 现在 - 上一个->时间戳;
if (运行时间 > 1000000000)
运行时间=1000000000;now = sched_clock( );
run_time = now - prev->timestamp;
if (run_time > 1000000000)
run_time = 1000000000;通常的截止时间为 1 秒(转换为纳秒)。该run_time值用于向进程收取 CPU 使用率。然而,具有较高平均睡眠时间的进程受到青睐:
The usual cut-off at 1 second (converted to nanoseconds)
applies. The run_time value is
used to charge the process for the CPU usage. However, a process
having a high average sleep time is favored:
run_time /= (CURRENT_BONUS(上一个) ? : 1);
run_time /= (CURRENT_BONUS(prev) ? : 1);
请记住,CURRENT_BONUS
返回一个 0 到 10 之间的值,该值与进程的平均睡眠时间成正比。
Remember that CURRENT_BONUS
returns a value between 0 and 10 that is proportional to the average
sleep time of the process.
在开始查看可运行进程之前,schedule( )必须禁用本地中断并获取保护运行队列的自旋锁:
Before starting to look at the runnable processes, schedule( ) must disable the local
interrupts and acquire the spin lock that protects the
runqueue:
spin_lock_irq(&rq->lock);
spin_lock_irq(&rq->lock);
正如第 3 章“进程终止”
部分所述,可能是一个正在终止的进程。要识别这种情况,请查看标志:prevschedule( )PF_DEAD
As explained in the section "Process Termination" in
Chapter 3, prev might be a process that is being
terminated. To recognize this case, schedule( ) looks at the PF_DEAD flag:
if (上一个->标志 & PF_DEAD)
上一个->状态 = EXIT_DEAD;if (prev->flags & PF_DEAD)
prev->state = EXIT_DEAD;接下来,schedule( )检查 的状态prev。如果它不可运行并且在内核模式下没有被抢占(请参阅第 4 章中的“从中断和异常返回”部分),则应将其从运行队列中删除。但是,如果它有非阻塞的挂起信号其状态为TASK_INTERRUPTIBLE,该函数将进程状态设置为TASK_RUNNING并将其放入运行队列中。此操作与将处理器分配给 不同prev;它只是提供了prev被选择执行的机会:
Next, schedule( ) examines
the state of prev. If it is not
runnable and it has not been preempted in Kernel Mode (see the
section "Returning from
Interrupts and Exceptions" in Chapter 4), then it should be
removed from the runqueue. However, if it has nonblocked pending
signals and its state is TASK_INTERRUPTIBLE, the function sets the
process state to TASK_RUNNING and
leaves it into the runqueue. This action is not the same as
assigning the processor to prev;
it just gives prev a chance to be
selected for execution:
if (上一个->状态!= TASK_RUNNING &&
!(preempt_count() & PREEMPT_ACTIVE)) {
if (上一个->状态== TASK_INTERRUPTIBLE && signal_pending(上一个))
上一个->状态 = TASK_RUNNING;
别的 {
if (上一个->状态== TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(上一个, rq);
}
}if (prev->state != TASK_RUNNING &&
!(preempt_count() & PREEMPT_ACTIVE)) {
if (prev->state == TASK_INTERRUPTIBLE && signal_pending(prev))
prev->state = TASK_RUNNING;
else {
if (prev->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
}该deactivate_task( )
函数从运行队列中删除进程:
The deactivate_task( )
function removes the process from the runqueue:
rq->nr_running--; dequeue_task(p, p->数组); p->数组=NULL;
rq->nr_running--; dequeue_task(p, p->array); p->array = NULL;
现在,schedule( )检查运行队列中剩余的可运行进程的数量。如果存在一些可运行的进程,则该函数调用该dependent_sleeper( )函数。在大多数情况下,该函数立即返回零。然而,如果内核支持超线程技术(请参阅本章后面的“多处理器系统中的运行队列平衡”部分),该函数将检查将要选择执行的进程的优先级是否明显低于同级进程进程已在同一物理 CPU 的逻辑 CPU 上运行;在这种特殊情况下,schedule( )
拒绝选择较低权限的进程并执行
交换器而是处理。
Now, schedule( ) checks the
number of runnable processes left in the runqueue. If there are some
runnable processes, the function invokes the dependent_sleeper( ) function. In most
cases, this function immediately returns zero. If, however, the
kernel supports the hyper-threading technology (see the section
"Runqueue Balancing in
Multiprocessor Systems" later in this chapter), the function
checks whether the process that is going to be selected for
execution has significantly lower priority than a sibling process
already running on a logical CPU of the same physical CPU; in this
particular case, schedule( )
refuses to select the lower privilege process and executes the
swapper process instead.
如果(rq->nr_running){
if (dependent_sleeper(smp_processor_id( ), rq)) {
下一个= rq->空闲;
转到 switch_tasks;
}
}if (rq->nr_running) {
if (dependent_sleeper(smp_processor_id( ), rq)) {
next = rq->idle;
goto switch_tasks;
}
}如果不存在可运行进程,则调用该函数idle_balance( )将某个可运行进程从另一个运行队列移动到本地运行队列;与后面的“ load_balance()函数”部分中描述的idle_balance( )类似。load_balance( )
If no runnable process exists, the function invokes idle_balance( ) to move some runnable
process from another runqueue to the local runqueue; idle_balance( ) is similar to load_balance( ), which is described in the
later section "The
load_balance( ) Function."
如果(!rq->nr_running){
空闲平衡(smp_processor_id(),rq);
如果(!rq->nr_running){
下一个= rq->空闲;
rq->expired_timestamp = 0;
wake_sleeping_dependent(smp_processor_id(), rq);
if (!rq->nr_running)
转到 switch_tasks;
}
}if (!rq->nr_running) {
idle_balance(smp_processor_id( ), rq);
if (!rq->nr_running) {
next = rq->idle;
rq->expired_timestamp = 0;
wake_sleeping_dependent(smp_processor_id( ), rq);
if (!rq->nr_running)
goto switch_tasks;
}
}如果idle_balance( )在本地运行队列中移动某些进程失败,schedule( )则调用wake_sleeping_dependent( )以重新调度空闲 CPU中的可运行进程(即,在运行交换器进程的每个 CPU 中)。正如前面讨论该dependent_sleeper( )功能时所解释的,当内核支持超线程技术时,可能会发生这种异常情况。但是,在单处理器系统中,或者当在本地运行队列中移动可运行进程的所有尝试都失败时,该函数会选择交换器
进程next并继续下一阶段。
If idle_balance( ) fails in
moving some process in the local runqueue, schedule( ) invokes wake_sleeping_dependent( ) to reschedule
runnable processes in idle CPUs (that is, in
every CPU that runs the swapper process). As
explained earlier when discussing the dependent_sleeper( ) function, this
unusual case might happen when the kernel supports the
hyper-threading technology. However, in uniprocessor systems, or
when all attempts to move a runnable process in the local runqueue
have failed, the function picks the swapper
process as next and continues
with the next phase.
假设该schedule(
)函数已确定运行队列包含一些可运行的进程;现在它必须检查这些可运行进程中至少有一个是活动的。如果不是,该函数交换runqueue数据结构的active和
字段的内容;expired因此,所有过期的进程都将变为活动状态,而空集则准备好接收将来将过期的进程。
Let's suppose that the schedule(
) function has determined that the runqueue includes some
runnable processes; now it has to check that at least one of these
runnable processes is active. If not, the function exchanges the
contents of the active and
expired fields of the runqueue
data structure; thus, all expired processes become active, while the
empty set is ready to receive the processes that will expire in the
future.
数组= rq->活动;
if (!array->nr_active) {
rq->有效 = rq->过期;
rq->过期=数组;
数组= rq->活动;
rq->expired_timestamp = 0;
rq->best_expired_prio = 140;
}array = rq->active;
if (!array->nr_active) {
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = 140;
}现在是时候在活动数据结构中查找可运行的进程了
(请参阅第 3 章中的“识别进程”prio_array_t部分)。首先,搜索活动集的位掩码中的第一个非零位。请记住,当相应的优先级列表不为空时,将设置位掩码中的一位。因此,第一个非零位的索引指示包含要运行的最佳进程的列表。然后,检索该列表中的第一个进程描述符:schedule(
)
It is time to look up a runnable process in the active
prio_array_t data structure (see
the section "Identifying
a Process" in Chapter
3). First of all, schedule(
) searches for the first nonzero bit in the bitmask of the
active set. Remember that a bit in the bitmask is set when the
corresponding priority list is not empty. Thus, the index of the
first nonzero bit indicates the list containing the best process to
run. Then, the first process descriptor in that list is
retrieved:
idx = sched_find_first_bit(数组->位图); next = list_entry(array->queue[idx].next, task_t, run_list);
idx = sched_find_first_bit(array->bitmap); next = list_entry(array->queue[idx].next, task_t, run_list);
该sched_find_first_bit( )
函数基于bsfl
汇编语言指令,返回 32 位字中设置为 1 的最低有效位的位索引。
The sched_find_first_bit( )
function is based on the bsfl
assembly language instruction, which returns the bit
index of the least significant bit set to one in a 32-bit
word.
局部next变量现在存储将替换的进程的描述符指针
prev。该schedule( )函数查看该next->activated字段。该字段对进程被唤醒时的状态进行编码,如表 7-6所示。
The next local variable now
stores the descriptor pointer of the process that will replace
prev. The schedule( ) function looks at the next->activated field. This field
encodes the state of the process when it was awakened, as
illustrated in Table
7-6.
表 7-6。进程描述符中activated字段的含义
Table 7-6. The meaning of the activated field in the process descriptor
价值 Value | 描述 Description |
|---|---|
0 0 | 该过程处于 The process was in |
1 1 | 进程处于 The process was in |
2 2 | 进程处于 The process was in |
−1 −1 | 该进程处于 The process was in |
如果next是常规进程并且正在从TASK_INTERRUPTIBLEorTASK_STOPPED状态唤醒,则调度程序会将自进程插入运行队列以来经过的纳秒添加到该进程的平均睡眠时间。换句话说,进程的睡眠时间会增加,以涵盖进程在运行队列中等待 CPU 所花费的时间:
If next is a conventional
process and it is being awakened from the TASK_INTERRUPTIBLE or TASK_STOPPED state, the scheduler adds to
the average sleep time of the process the nanoseconds elapsed since
the process was inserted into the runqueue. In other words, the
sleep time of the process is increased to cover also the time spent
by the process in the runqueue waiting for the CPU:
if (下一个->prio >= 100 && 下一个->激活> 0) {
unsigned long long delta = 现在 - 下一个->时间戳;
if (下一个->激活== 1)
德尔塔=(德尔塔*38)/128;
数组=下一个->数组;
dequeue_task(下一个, 数组);
recalc_task_prio(下一个, 下一个->时间戳 + 增量);
enqueue_task(下一个,数组);
}
下一个->激活 = 0;if (next->prio >= 100 && next->activated > 0) {
unsigned long long delta = now - next->timestamp;
if (next->activated == 1)
delta = (delta * 38) / 128;
array = next->array;
dequeue_task(next, array);
recalc_task_prio(next, next->timestamp + delta);
enqueue_task(next, array);
}
next->activated = 0;观察调度程序区分由中断处理程序或可延迟函数唤醒的进程与由系统调用服务例程或内核线程唤醒的进程。在前一种情况下,调度程序会添加整个运行队列等待时间,而在后一种情况下,它仅添加该时间的一小部分。这是因为交互进程更有可能被异步事件(想象一下用户按下键盘上的按键)而不是同步事件唤醒。
Observe that the scheduler makes a distinction between a process awakened by an interrupt handler or deferrable function, and a process awakened by a system call service routine or a kernel thread. In the former case, the scheduler adds the whole runqueue waiting time, while in the latter it adds just a fraction of that time. This is because interactive processes are more likely to be awakened by asynchronous events (think of the user pressing keys on the keyboard) rather than by synchronous ones.
现在该schedule( )
函数已经确定了下一个要运行的进程。稍后,内核将访问 的thread_info数据结构,其地址存储在靠近进程描述符next顶部的位置:next
Now the schedule( )
function has determined the next process to run. In a moment, the
kernel will access the thread_info data structure of next, whose address is stored close to the
top of next's process
descriptor:
切换任务: 预取(下一个);
switch_tasks: prefetch(next);
该prefetch宏是对 CPU 控制单元的提示,将 的next进程描述符的第一个字段的内容放入硬件缓存中。这里只是为了提高 的性能schedule( ),因为数据的移动是与后面指令的执行并行的,这并不影响next。
The prefetch macro is a
hint to the CPU control unit to bring the contents of the first
fields of next's process
descriptor in the hardware cache. It is here just to improve the
performance of schedule( ),
because the data are moved in parallel to the execution of the
following instructions, which do not affect next.
在替换之前prev,调度程序应该做一些管理工作:
Before replacing prev, the
scheduler should do some administrative work:
clear_tsk_need_resched(上一个); rcu_qsctr_inc(上一个->thread_info->cpu);
clear_tsk_need_resched(prev); rcu_qsctr_inc(prev->thread_info->cpu);
该clear_tsk_need_resched(
)函数会清除TIF_NEED_RESCHED的标志prev,以防万一schedule( )以惰性方式调用。然后,该函数记录CPU正在经历静止状态(参见第5章中的“读-复制更新(RCU) ”部分)。
The clear_tsk_need_resched(
) function clears the TIF_NEED_RESCHED flag of prev, just in case schedule( ) has been invoked in the lazy
way. Then, the function records that the CPU is going through a
quiescent state (see the section "Read-Copy Update (RCU)"
in Chapter 5).
该schedule( )函数还必须减少 的平均睡眠时间prev,为其分配进程使用的 CPU 时间片:
The schedule( ) function
must also decrease the average sleep time of prev, charging to it the slice of CPU time
used by the process:
上一个->sleep_avg -= run_time;
if ((long)prev->sleep_avg <= 0)
上一个->sleep_avg = 0;
上一个->时间戳 = 上一个->last_ran = 现在;prev->sleep_avg -= run_time;
if ((long)prev->sleep_avg <= 0)
prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;然后更新该进程的时间戳。
The timestamps of the process are then updated.
prev和很可能是next同一个进程:如果运行队列中不存在其他更高或同等优先级的活动进程,就会发生这种情况。在这种情况下,该函数会跳过进程切换:
It is quite possible that prev and next are the same process: this happens if
no other higher or equal priority active process is present in the
runqueue. In this case, the function skips the process
switch:
如果(上一个==下一个){
spin_unlock_irq(&rq->lock);
转到完成计划;
}if (prev == next) {
spin_unlock_irq(&rq->lock);
goto finish_schedule;
}此时,prev和
next是不同的进程,并且是真正的进程切换:
At this point, prev and
next are different processes, and
the process switch is for real:
下一个->时间戳=现在; rq->nr_switches++; rq->curr = 下一个; 上一个 = context_switch(rq, 上一个, 下一个);
next->timestamp = now; rq->nr_switches++; rq->curr = next; prev = context_switch(rq, prev, next);
该context_switch( )
函数设置 的地址空间next。正如我们将在第九章“内核线程的内存描述符”中看到的,进程描述符的字段指向进程使用的内存描述符,而该
字段指向进程拥有的内存描述符。对于正常进程,这两个字段保存相同的地址;然而,内核线程没有自己的地址空间,其字段始终设置为。该函数确保 if
是内核线程,它使用以下地址空间:active_mmmmmmNULLcontext_switch( )nextprev
The context_switch( )
function sets up the address space of next. As we'll see in "Memory Descriptor of Kernel
Threads" in Chapter
9, the active_mm field of
the process descriptor points to the memory descriptor that is used
by the process, while the mm
field points to the memory descriptor owned by the process. For
normal processes, the two fields hold the same address; however, a
kernel thread does not have its own address space and its mm field is always set to NULL. The context_switch( ) function ensures that if
next is a kernel thread, it uses
the address space used by prev:
如果(!下一个->毫米){
下一个->active_mm = 上一个->active_mm;
atomic_inc(&prev->active_mm->mm_count);
Enter_lazy_tlb(上一个->active_mm, 下一个);
}if (!next->mm) {
next->active_mm = prev->active_mm;
atomic_inc(&prev->active_mm->mm_count);
enter_lazy_tlb(prev->active_mm, next);
}直到 Linux 2.2 版本,内核线程都有自己的地址空间。这种设计选择不是最优的,因为每当调度程序选择新进程时,即使它是内核线程,也必须更改页表。由于内核线程在内核模式下运行,因此它们仅使用线性地址空间的第四个千兆字节,其映射对于系统中的所有进程都是相同的。更糟糕的是,写入cr3
寄存器使所有 TLB 条目无效(请参阅第 2 章中的“转换后备缓冲区(TLB) ” ),这会导致显着的性能损失。如今,Linux 的效率要高得多,因为如果是内核线程,则根本不会触及页表。作为进一步的优化,如果是内核线程,该函数将进程设置为惰性TLB模式(参见第2章中的“转换后备缓冲区(TLB) ”部分)。nextnextschedule( )
Up to Linux version 2.2, kernel threads had their own address
space. That design choice was suboptimal, because the Page Tables
had to be changed whenever the scheduler selected a new process,
even if it was a kernel thread. Because kernel threads run in Kernel
Mode, they use only the fourth gigabyte of the linear address space,
whose mapping is the same for all processes in the system. Even
worse, writing into the cr3
register invalidates all TLB entries (see "Translation Lookaside Buffers
(TLB)" in Chapter
2), which leads to a significant performance penalty. Linux
is nowadays much more efficient because Page Tables aren't touched
at all if next is a kernel
thread. As further optimization, if next is a kernel thread, the schedule( ) function sets the process into
lazy TLB mode (see the section "Translation Lookaside Buffers
(TLB)" in Chapter
2).
相反,如果next是常规进程,则该context_switch(
)函数将 的地址空间替换prev为以下之一next:
Conversely, if next is a
regular process, the context_switch(
) function replaces the address space of prev with the one of next:
if (下一个->mm)
switch_mm(上一个->active_mm, 下一个->mm, 下一个);if (next->mm)
switch_mm(prev->active_mm, next->mm, next);如果prev是内核线程或退出进程,则该函数将指针保存到运行队列字段中
context_switch(
)使用的内存描述符,然后重置
:prevprev_mmprev->active_mm
If prev is a kernel thread
or an exiting process, the context_switch(
) function saves the pointer to the memory descriptor used
by prev in the runqueue's
prev_mm field, then resets
prev->active_mm:
if (!prev->mm) {
rq->prev_mm = prev->active_mm;
上一个->active_mm = NULL;
}if (!prev->mm) {
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}现在context_switch( )终于可以调用来执行和switch_to( )之间的进程切换了(参见第 3 章中的“执行进程切换”一节):prevnext
Now context_switch( ) can
finally invoke switch_to( ) to
perform the process switch between prev and next (see the section "Performing the Process
Switch" in Chapter
3):
switch_to(上一个,下一个,上一个); 返回上一个;
switch_to(prev, next, prev); return prev;
context_switch( )宏调用之后的和schedule( )函数
的指令switch_to不会立即由next进程执行,而是稍后当
prev调度程序再次选择它执行时执行。然而,此时prev局部变量并不是指向我们开始描述时要被替换的原始进程,而是指向再次调度时schedule( )被我们原来替换的进程。(如果您感到困惑,请返回并阅读第 3 章中的“执行进程切换”prev部分。)进程切换后的第一条指令是:
The instructions of the context_switch( ) and schedule( ) functions following the
switch_to macro invocation will
not be performed right away by the next process, but at a later time by
prev when the scheduler selects
it again for execution. However, at that moment, the prev local variable does not point to our
original process that was to be replaced when we started the
description of schedule( ), but
rather to the process that was replaced by our original prev when it was scheduled again. (If you
are confused, go back and read the section "Performing the Process
Switch" in Chapter
3.) The first instructions after a process switch are:
障碍( ); 完成任务切换(上一个);
barrier( ); finish_task_switch(prev);
在调用context_switch( )中的函数之后schedule( ),barrier( )宏立即为代码生成一个优化屏障(请参阅第 5 章中的“优化和内存屏障”部分)。然后,执行该函数:finish_task_switch(
)
Right after the invocation of the context_switch( ) function in schedule( ), the barrier( ) macro yields an optimization
barrier for the code (see the section "Optimization and Memory
Barriers" in Chapter
5). Then, the finish_task_switch(
) function is executed:
mm = this_rq( )->prev_mm;
this_rq( )->prev_mm = NULL;
prev_task_flags = prev->flags;
spin_unlock_irq(&this_rq( )->lock);
如果(毫米)
毫米落差(mm);
if (prev_task_flags & PF_DEAD)
put_task_struct(上一个);mm = this_rq( )->prev_mm;
this_rq( )->prev_mm = NULL;
prev_task_flags = prev->flags;
spin_unlock_irq(&this_rq( )->lock);
if (mm)
mmdrop(mm);
if (prev_task_flags & PF_DEAD)
put_task_struct(prev);如果prev是内核线程,则prev_mm运行队列的字段存储借给的内存描述符的地址
prev。正如我们将在第 9 章中看到的,mmdrop( )减少内存描述符的使用计数器;如果计数器达到 0(可能是因为
prev是僵尸进程),该函数还会释放描述符以及关联的页表和虚拟内存区域。
If prev is a kernel thread,
the prev_mm field of the runqueue
stores the address of the memory descriptor that was lent to
prev. As we'll see in Chapter 9, mmdrop( ) decreases the usage counter of
the memory descriptor; if the counter reaches 0 (likely because
prev is a zombie process), the
function also frees the descriptor together with the associated Page
Tables and virtual memory regions.
该finish_task_switch( )
函数还释放运行队列的自旋锁并启用本地中断。然后,它检查是否prev是一个僵尸任务正在从系统中删除(参见第3章“进程终止”
一节);如果是这样,它调用释放进程描述符引用计数器并删除对该进程的所有剩余引用(请参阅第 3 章中的“进程删除”
部分)。put_task_struct( )
The finish_task_switch( )
function also releases the spin lock of the runqueue and enables the
local interrupts. Then, it checks whether prev is a zombie task that is being
removed from the system (see the section "Process Termination" in
Chapter 3); if so, it
invokes put_task_struct( ) to
free the process descriptor reference counter and drop all remaining
references to the process (see the section "Process Removal" in
Chapter 3).
该函数的最后一条指令schedule( )是:
The very last instructions of the schedule( ) function are:
完成时间表:
上一个 = 当前;
if (上一个->lock_深度 >= 0)
__reacquire_kernel_lock();
preempt_enable_no_resched();
if (test_bit(TIF_NEED_RESCHED, ¤t_thread_info( )->flags)
转到 need_resched;
返回;finish_schedule:
prev = current;
if (prev->lock_depth >= 0)
_ _reacquire_kernel_lock( );
preempt_enable_no_resched();
if (test_bit(TIF_NEED_RESCHED, ¤t_thread_info( )->flags)
goto need_resched;
return;如你所见,schedule( )
重新获取大内核锁如有必要,重新启用内核抢占,并检查其他进程是否已设置TIF_NEED_RESCHED当前进程的标志。在这种情况下,整个schedule(
)函数会从头开始重新执行;否则,函数终止。
As you see, schedule( )
reacquires the big kernel lock if necessary, reenables kernel preemption, and checks
whether some other process has set the TIF_NEED_RESCHED flag of the current
process. In this case, the whole schedule(
) function is reexecuted from the beginning; otherwise,
the function terminates.
我们在第 4 章中看到Linux 坚持对称多处理模型(SMP);从本质上讲,这意味着内核不应该对某个 CPU 相对于其他 CPU 有任何偏见。然而,多处理器机器有许多不同的风格,并且调度程序根据硬件特性的不同而表现不同。特别是,我们将考虑以下三种类型的多处理器机器:
We have seen in Chapter 4 that Linux sticks to the Symmetric Multiprocessing model (SMP ); this means, essentially, that the kernel should not have any bias toward one CPU with respect to the others. However, multiprocessor machines come in many different flavors, and the scheduler behaves differently depending on the hardware characteristics. In particular, we will consider the following three types of multiprocessor machines:
直到最近,这还是多处理器机器最常见的架构。这些机器有一组由所有 CPU 共享的通用 RAM 芯片。
Until recently, this was the most common architecture for multiprocessor machines. These machines have a common set of RAM chips shared by all CPUs.
超线程芯片是同时执行多个执行线程的微处理器;它包括内部寄存器的多个副本,并可以在它们之间快速切换。这项技术由英特尔发明,允许处理器在当前线程因内存访问而停止时利用机器周期来执行另一个线程。Linux 将超线程物理 CPU 视为多个不同的逻辑 CPU。
A hyper-threaded chip is a microprocessor that executes several threads of execution at once; it includes several copies of the internal registers and quickly switches between them. This technology, which was invented by Intel, allows the processor to exploit the machine cycles to execute another thread while the current thread is stalled for a memory access. A hyper-threaded physical CPU is seen by Linux as several different logical CPUs.
CPU 和 RAM 芯片被分组在本地“节点”中(通常一个节点包括一个 CPU 和一些 RAM 芯片)。内存仲裁器(一种特殊电路,用于串行化系统中 CPU 执行的 RAM 访问,请参阅第 2 章中的“内存地址” 部分)是经典多处理器系统的性能瓶颈。在NUMA架构中,当CPU访问自己节点内的“本地”RAM芯片时,很少或没有争用,因此访问通常很快;另一方面,访问其节点之外的“远程”RAM 芯片要慢得多。第 8 章介绍了Linux 内核内存分配器如何支持 NUMA 架构。
CPUs and RAM chips are grouped in local "nodes" (usually a node includes one CPU and a few RAM chips). The memory arbiter (a special circuit that serializes the accesses to RAM performed by the CPUs in the system, see the section "Memory Addresses" in Chapter 2) is a bottleneck for the performance of the classic multiprocessor systems. In a NUMA architecture, when a CPU accesses a "local" RAM chip inside its own node, there is little or no contention, thus the access is usually fast; on the other hand, accessing a "remote" RAM chip outside of its node is much slower. We'll mention in the section "Non-Uniform Memory Access (NUMA)" in Chapter 8 how the Linux kernel memory allocator supports NUMA architectures.
这些基本类型的多处理器系统经常组合在一起。例如,包含两个不同超线程 CPU 的主板被内核视为四个逻辑 CPU。
These basic kinds of multiprocessor systems are often combined. For instance, a motherboard that includes two different hyper-threaded CPUs is seen by the kernel as four logical CPUs.
正如我们在上一节中看到的,该schedule( )函数从本地 CPU 的运行队列中选择要运行的新进程。因此,给定的CPU只能执行相应运行队列中包含的可运行进程。另一方面,可运行进程始终存储在一个运行队列中:可运行进程不会出现在两个或多个运行队列中。因此,在进程保持可运行状态之前,它通常会绑定到一个 CPU。
As we have seen in the previous section, the schedule( ) function picks the new process to
run from the runqueue of the local CPU. Therefore, a given CPU can
execute only the runnable processes that are contained in the
corresponding runqueue. On the other hand, a runnable process is always
stored in exactly one runqueue: no runnable process ever appears in two
or more runqueues. Therefore, until a process remains runnable, it is
usually bound to one CPU.
这种设计选择通常有利于系统性能,因为每个 CPU 的硬件缓存很可能充满了运行队列中可运行进程所拥有的数据。然而,在某些情况下,将可运行进程绑定到给定 CPU 可能会导致严重的性能损失。例如,考虑大量大量使用 CPU 的批处理进程:如果它们中的大多数最终都位于同一个运行队列中,则系统中的一个 CPU 将过载,而其他 CPU 将几乎空闲。
This design choice is usually beneficial for system performance, because the hardware cache of every CPU is likely to be filled with data owned by the runnable processes in the runqueue. In some cases, however, binding a runnable process to a given CPU might induce a severe performance penalty. For instance, consider a large number of batch processes that make heavy use of the CPU: if most of them end up in the same runqueue, one CPU in the system will be overloaded, while the others will be nearly idle.
因此,内核会定期检查运行队列的工作负载是否平衡,并在必要时将某些进程从一个运行队列移动到另一个运行队列。然而,为了从多处理器系统中获得最佳性能,负载平衡算法应考虑系统中 CPU 的拓扑。从内核版本 2.6.7 开始,Linux 采用了基于“调度域”概念的复杂运行队列平衡算法。由于调度域,该算法可以轻松地针对各种现有的多处理器架构进行调整(甚至针对最近的架构,例如基于“多核”微处理器的架构)。
Therefore, the kernel periodically checks whether the workloads of the runqueues are balanced and, if necessary, moves some process from one runqueue to another. However, to get the best performance from a multiprocessor system, the load balancing algorithm should take into consideration the topology of the CPUs in the system. Starting from kernel version 2.6.7, Linux sports a sophisticated runqueue balancing algorithm based on the notion of "scheduling domains." Thanks to the scheduling domains, the algorithm can be easily tuned for all kinds of existing multiprocessor architectures (and even for recent architectures such as those based on the "multi-core" microprocessors).
本质上,调度域是一组 CPU,其工作负载应由内核保持平衡。一般来说,调度域是分层组织的:最顶层的调度域通常跨越系统中的所有 CPU,包括子调度域,每个子调度域都包含 CPU 的子集。由于调度域的层次结构,工作负载平衡可以以相当有效的方式完成。
Essentially, a scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel. Generally speaking, scheduling domains are hierarchically organized: the top-most scheduling domain, which usually spans all CPUs in the system, includes children scheduling domains, each of which include a subset of the CPUs. Thanks to the hierarchy of scheduling domains, workload balancing can be done in a rather efficient way.
每个调度域又被划分为一个或多个 组,每个组代表调度域的CPU的一个子集。工作负载平衡始终在调度域的组之间进行。换句话说,仅当某个调度域中某个组的总工作负载明显低于同一调度域中另一组的工作负载时,进程才会从一个 CPU 移动到另一个 CPU。
Every scheduling domain is partitioned, in turn, in one or more groups, each of which represents a subset of the CPUs of the scheduling domain. Workload balancing is always done between groups of a scheduling domain. In other words, a process is moved from one CPU to another only if the total workload of some group in some scheduling domain is significantly lower than the workload of another group in the same scheduling domain.
图 7-2 说明了调度域层次结构的三个示例,对应于多处理器机器的三种主要体系结构。
Figure 7-2 illustrates three examples of scheduling domain hierarchies, corresponding to the three main architectures of multiprocessor machines.
图 7-2 (a) 表示由 2-CPU 经典多处理器架构的单个调度域组成的层次结构。调度域仅包含两个组,每个组包含一个CPU。
Figure 7-2 (a) represents a hierarchy composed by a single scheduling domain for a 2-CPU classic multiprocessor architecture. The scheduling domain includes only two groups, each of which includes one CPU.
图 7-2 (b) 表示采用超线程技术的 2-CPU 多处理器盒的两级层次结构。顶层调度域跨越系统中全部四个逻辑CPU,由两组组成。每组顶级域对应一个子调度域,跨越一个物理CPU。底层调度域(也称为基础调度域 )包括两组,每个组对应一个逻辑 CPU。
Figure 7-2 (b) represents a two-level hierarchy for a 2-CPU multiprocessor box with hyper-threading technology. The top-level scheduling domain spans all four logical CPUs in the system, and it is composed by two groups. Each group of the top-level domain corresponds to a child scheduling domain and spans a physical CPU. The bottom-level scheduling domains (also called base scheduling domains ) include two groups, one for each logical CPU.
最后,图 7-2 (c) 表示 8-CPU NUMA 架构的两级层次结构,其中有两个节点,每个节点有 4 个 CPU。顶级域分为两组,每组对应一个不同的节点。每个基本调度域跨越单个节点内的 CPU,并具有四个组,每个组跨越单个 CPU。
Finally, Figure 7-2 (c) represents a two-level hierarchy for an 8-CPU NUMA architecture with two nodes and four CPUs per node. The top-level domain is organized in two groups, each of which corresponds to a different node. Every base scheduling domain spans the CPUs inside a single node and has four groups, each of which spans a single CPU.
每个调度域都由一个描述符表示sched_domain,而调度域内的每个组都由一个描述符表示sched_group。每个sched_domain描述符都包含一个字段
groups,该字段指向组描述符列表中的第一个元素。此外,该结构体parent的字段sched_domain指向父调度域的描述符(如果有的话)。
Every scheduling domain is represented by a sched_domain descriptor, while every group
inside a scheduling domain is represented by a sched_group descriptor. Each sched_domain descriptor includes a field
groups, which points to the first
element in a list of group descriptors. Moreover, the parent field of the sched_domain structure points to the
descriptor of the parent scheduling domain, if any.
sched_domain系统中所有物理CPU的描述符都存储在per-CPU变量
中phys_domains。如果内核不支持超线程技术,则这些域位于域层次结构的最底层,sd运行队列描述符的字段指向它们,即它们是基础调度域。相反,如果内核支持超线程技术,则底层调度域存储在per-CPU变量中cpu_domains。
The sched_domain descriptors
of all physical CPUs in the system are stored in the per-CPU variable
phys_domains. If the kernel does
not support the hyper-threading technology, these domains are at the
bottom level of the domain hierarchy, and the sd fields of the runqueue descriptors point
to them—that is, they are the base scheduling domains. Conversely, if
the kernel supports the hyper-threading technology, the bottom-level
scheduling domains are stored in the per-CPU variable cpu_domains.
为了保持系统中的运行队列平衡,
每个时钟周期rebalance_tick( )都会调用该函数scheduler_tick( )一次。它接收本地 CPU 的索引this_cpu、本地运行队列的地址
this_rq和标志来作为其参数idle,该标志可以采用以下值:
To keep the runqueues in the system balanced, the
rebalance_tick( ) function is
invoked by scheduler_tick( ) once
every tick. It receives as its parameters the index this_cpu of the local CPU, the address
this_rq of the local runqueue, and
a flag, idle, which can assume the
following values:
SCHED_IDLESCHED_IDLECPU当前空闲,即current是
swapper进程。
The CPU is currently idle, that is, current is the
swapper process.
NOT_IDLENOT_IDLECPU当前不空闲,即current不是
交换器进程。
The CPU is not currently idle, that is, current is not the
swapper process.
该rebalance_tick( )
函数首先确定运行队列中的进程数,并更新运行队列的平均工作负载;为此,该函数访问运行队列描述符的nr_running和
cpu_load字段。
The rebalance_tick( )
function determines first the number of processes in the runqueue and
updates the runqueue's average workload; to do this, the function
accesses the nr_running and
cpu_load fields of the runqueue
descriptor.
然后,在从基域(由本地运行队列描述符的字段引用)到顶级域的rebalance_tick( )
路径中的所有调度域开始循环。sd在每次迭代中,该函数都会确定是否到了调用该load_balance( )函数的时间,从而在调度域上执行重新平衡操作。的值idle和描述符中存储的一些参数sched_domain决定了 的调用频率load_balance( )。如果idle等于SCHED_IDLE,则运行队列为空,并
rebalance_tick( )调用load_balance( )经常(对于与逻辑和物理 CPU 相对应的调度域,大约每一到两个滴答周期一次)。相反,如果idle等于NOT_IDLE,则谨慎rebalance_tick( )调用(对于逻辑 CPU 对应的调度域,大约每 10 毫秒调用一次;对于物理 CPU 对应的调度域,大约每 100 毫秒调用一次)。load_balance( )
Then, rebalance_tick( )
starts a loop over all scheduling domains in the path from the base
domain (referenced by the sd field
of the local runqueue descriptor) to the top-level domain. In each
iteration the function determines whether the time has come to invoke
the load_balance( ) function, thus
executing a rebalancing operation on the scheduling domain. The value
of idle and some parameters stored
in the sched_domain descriptor
determine the frequency of the invocations of load_balance( ). If idle is equal to SCHED_IDLE, then the runqueue is empty, and
rebalance_tick( ) invokes load_balance( ) quite often (roughly once
every one or two ticks for scheduling domains corresponding to logical
and physical CPUs). Conversely, if idle is equal to NOT_IDLE, rebalance_tick( ) invokes load_balance( ) sparingly (roughly once
every 10 milliseconds for scheduling domains corresponding to logical
CPUs, and once every 100 milliseconds for scheduling domains
corresponding to physical CPUs).
该load_balance( )
功能检查调度域是否存在明显不平衡;更准确地说,它检查是否可以通过将某些进程从最繁忙的组移动到本地 CPU 的运行队列来减少不平衡。如果是这样,该函数将尝试进行此迁移。它接收四个参数:
The load_balance( )
function checks whether a scheduling domain is significantly
unbalanced; more precisely, it checks whether unbalancing can be
reduced by moving some processes from the busiest group to the
runqueue of the local CPU. If so, the function attempts this
migration. It receives four parameters:
this_cputhis_cpu本地CPU的索引
The index of the local CPU
this_rqthis_rq本地运行队列描述符的地址
The address of the descriptor of the local runqueue
sdsd指向要检查的调度域的描述符
Points to the descriptor of the scheduling domain to be checked
idleidleSCHED_IDLE
(本地 CPU 空闲)或NOT_IDLE
Either SCHED_IDLE
(local CPU is idle) or NOT_IDLE
该函数执行以下操作:
The function performs the following operations:
获取this_rq->lock自旋锁。
Acquires the this_rq->lock spin lock.
调用该find_busiest_group(
)函数分析调度域内各组的工作负载。该函数返回最繁忙组的描述符的地址
sched_group,前提是该组不包括本地CPU;在这种情况下,该函数还返回要移入本地运行队列以恢复平衡的进程数。另一方面,如果最繁忙的组包括本地 CPU 或所有组基本平衡,则函数返回NULL。此过程并不简单,因为该函数尝试过滤工作负载中的统计波动。
Invokes the find_busiest_group(
) function to analyze the workloads of the groups inside
the scheduling domain. The function returns the address of the
sched_group descriptor of the
busiest group, provided that this group does not include the local
CPU; in this case, the function also returns the number of
processes to be moved into the local runqueue to restore
balancing. On the other hand, if either the busiest group includes
the local CPU or all groups are essentially balanced, the function
returns NULL. This procedure is
not trivial, because the function tries to filter the statistical
fluctuations in the workloads.
如果find_busiest_group( )
没有找到比调度域中其他组显着繁忙的不包括本地CPU的组,则该函数释放自旋锁,调整调度this_rq->lock域描述符中的参数,以延迟load_balance(
)本地上的下一次调用CPU,并终止。
If find_busiest_group( )
did not find a group not including the local CPU that is
significantly busier than the other groups in the scheduling
domain, the function releases the this_rq->lock spin lock, tunes the
parameters in the scheduling domain descriptor so as to delay the
next invocation of load_balance(
) on the local CPU, and terminates.
调用该函数以查找步骤 2 中找到的组中最繁忙的 CPU。该函数返回相应运行队列的find_busiest_queue(
)描述符地址。busiest
Invokes the find_busiest_queue(
) function to find the busiest CPUs in the group found
in step 2. The function returns the descriptor address busiest of the corresponding
runqueue.
获取第二个自旋锁,即busiest->lock自旋锁。为了防止死锁,必须小心执行:this_rq->lock首先释放 ,然后通过增加 CPU 索引来获取两个锁。
Acquires a second spin lock, namely the busiest->lock spin lock. To prevent
deadlocks, this has to be done carefully: the this_rq->lock is first released, then
the two locks are acquired by increasing CPU indices.
调用该move_tasks( )
函数来尝试将某些进程从busiest运行队列移动到本地运行队列
this_rq(请参阅下一节)。
Invokes the move_tasks( )
function to try moving some processes from the busiest runqueue to the local runqueue
this_rq (see the next
section).
如果该move_task( )
函数无法将某些进程迁移到本地运行队列,则调度域仍然不平衡。将标志设置为 1busiest->active_balance并唤醒迁移 其描述符存储在的内核线程
busiest->migration_thread。迁移内核线程遍历调度域链,从运行队列的基域到顶域,寻找空闲的CPU 。busiest如果发现有空闲 CPU,则调用内核线程move_tasks( )将一个进程移至空闲运行队列。
If the move_task( )
function failed in migrating some process to the local runqueue,
the scheduling domain is still unbalanced. Sets to 1 the busiest->active_balance flag and
wakes up the migration kernel thread whose descriptor is stored in
busiest->migration_thread.
The migration kernel thread walks the chain
of the scheduling domain, from the base domain of the busiest runqueue to the top domain,
looking for an idle CPU. If an idle CPU is found, the kernel
thread invokes move_tasks( ) to
move one process into the idle runqueue.
释放busiest->lock和this_rq->lock自旋锁。
Releases the busiest->lock and this_rq->lock spin locks.
终止。
Terminates.
该move_tasks( )
函数将进程从源运行队列移动到本地运行队列。它接收六个参数:(this_rq
本地this_cpu运行队列描述符和本地CPU索引)、busiest(源运行队列描述符)、
max_nr_move(要移动的最大进程数)、sd(此平衡操作所在的调度域描述符的地址)进行),以及idle标志(除了SCHED_IDLE和之外,当该函数被 间接调用时,NOT_IDLE也可以设置该标志
;请参阅本章前面的“ schedule() 函数”部分)。NEWLY_IDLEidle_balance(
)
The move_tasks( )
function moves processes from a source runqueue to the local runqueue.
It receives six parameters: this_rq
and this_cpu (the local runqueue
descriptor and the local CPU index), busiest (the source runqueue descriptor),
max_nr_move (the maximum number of
processes to be moved), sd (the
address of the scheduling domain descriptor in which this balancing
operation is carried on), and the idle flag (beside SCHED_IDLE and NOT_IDLE, this flag can also be set to
NEWLY_IDLE when the function is
indirectly invoked by idle_balance(
); see the section "The schedule( ) Function"
earlier in this chapter).
该函数首先分析运行队列中过期的进程
busiest,从优先级较高的进程开始。当扫描完所有过期进程后,该函数将扫描busiest运行队列的活动进程。对于每个候选进程,该函数调用can_migrate_task( ),如果满足以下所有条件,则返回 1:
The function first analyzes the expired processes of the
busiest runqueue, starting from the
higher priority ones. When all expired processes have been scanned,
the function scans the active processes of the busiest runqueue. For each candidate
process, the function invokes can_migrate_task( ), which returns 1 if all
the following conditions hold:
远程 CPU 当前未执行该进程。
The process is not being currently executed by the remote CPU.
本地 CPU 包含在cpus_allowed进程描述符的位掩码中。
The local CPU is included in the cpus_allowed bitmask of the process
descriptor.
至少满足以下条件之一:
本地CPU空闲。如果内核支持超线程技术,则本地物理芯片中的所有逻辑CPU必须处于空闲状态。
内核在平衡调度域方面遇到了困难,因为多次尝试移动进程都失败了。
要移动的进程不是“高速缓存热”的(它最近没有在远程CPU上执行过,因此可以假设远程CPU的硬件高速缓存中不包含该进程的数据)。
At least one of the following holds:
The local CPU is idle. If the kernel supports the hyper-threading technology, all logical CPUs in the local physical chip must be idle.
The kernel is having trouble in balancing the scheduling domain, because repeated attempts to move processes have failed.
The process to be moved is not "cache hot" (it has not recently executed on the remote CPU, so one can assume that no data of the process is included in the hardware cache of the remote CPU).
如果can_migrate_task( )
返回值 1,move_tasks( )
则调用该pull_task( )函数将候选进程移动到本地运行队列。本质上,
pull_task( )执行dequeue_task( )从远程运行队列中删除进程,然后执行enqueue_task( )将进程插入到本地运行队列中,最后,如果刚刚移动的进程具有比 更高的动态优先级current,则调用resched_task( )抢占本地 CPU 的当前进程。
If can_migrate_task( )
returns the value 1, move_tasks( )
invokes the pull_task( ) function
to move the candidate process to the local runqueue. Essentially,
pull_task( ) executes dequeue_task( ) to remove the process from
the remote runqueue, then executes enqueue_task( ) to insert the process in the
local runqueue, and finally, if the process just moved has higher
dynamic priority than current,
invokes resched_task( ) to preempt
the current process of the local CPU.
引入了多个系统调用来允许进程更改其优先级和调度策略。作为一般规则,用户始终可以降低其进程的优先级。但是,如果他们想要修改属于某个其他用户的进程的优先级,或者想要增加自己进程的优先级,他们必须具有超级用户权限。
Several system calls have been introduced to allow processes to change their priorities and scheduling policies. As a general rule, users are always allowed to lower the priorities of their processes. However, if they want to modify the priorities of processes belonging to some other user or if they want to increase the priorities of their own processes, they must have superuser privileges.
nice( ) [ * ]系统调用允许进程更改其基本优先级。参数中包含的整数值increment用于修改
nice进程描述符的字段。Nice Unix命令基于此系统调用,允许用户运行具有修改的调度优先级的程序。
The nice( ) [*] system call allows processes to change their base
priority. The integer value contained in the increment parameter is used to modify the
nice field of the process
descriptor. The nice Unix command,
which allows users to run programs with modified scheduling priority,
is based on this system call.
服务sys_nice( )例程处理nice( )系统调用。尽管该increment
参数可以具有任何值,但大于 40 的绝对值将被修剪为 40。传统上,负值对应于优先级增量的请求并需要超级用户权限,而正值对应于优先级降低的请求。在负增量的情况下,该函数调用该capable( )函数来验证该进程是否具有CAP_SYS_NICE
能力。此外,该函数还调用security_task_setnice( )安全挂钩。我们将在第 20 章中讨论该函数。如果用户被证明具有更改优先级所需的权限,sys_nice( )
则转换current->static_prio到好值的范围,添加 的值increment,并调用该set_user_nice( )函数。反过来,后一个函数获取本地运行队列锁,更新 的静态优先级current,调用该
resched_task( )函数以允许其他进程抢占current,并释放运行队列锁。
The sys_nice( ) service
routine handles the nice( ) system
call. Although the increment
parameter may have any value, absolute values larger than 40 are
trimmed down to 40. Traditionally, negative values correspond to
requests for priority increments and require superuser privileges,
while positive ones correspond to requests for priority decreases. In
the case of a negative increment, the function invokes the capable( ) function to verify whether the
process has a CAP_SYS_NICE
capability. Moreover, the function invokes the security_task_setnice( ) security hook. We
discuss that function in Chapter
20. If the user turns out to have the privilege required to
change priorities, sys_nice( )
converts current->static_prio to
the range of nice values, adds the value of increment, and invokes the set_user_nice( ) function. In turn, the
latter function gets the local runqueue lock, updates the static
priority of current, invokes the
resched_task( ) function to allow
other processes to preempt current,
and release the runqueue lock.
维护系统nice( )调用只是为了向后兼容;它已被替换为setpriority( ) 接下来描述系统调用。
The nice( ) system call is
maintained for backward compatibility only; it has been replaced by
the setpriority( ) system call described next.
系统nice( )调用仅影响调用它的进程。另外两个系统调用(表示为getpriority( )和
setpriority( ))作用于给定组中所有进程的基本优先级。getpriority( )返回 20 减去给定组中所有进程中的最低
nice字段值,即这些进程中的最高优先级;setpriority( )将给定组中所有进程的基本优先级设置为给定值。
The nice( ) system call
affects only the process that invokes it. Two other system calls,
denoted as getpriority( ) and
setpriority( ), act on the base
priorities of all processes in a given group. getpriority( ) returns 20 minus the lowest
nice field value among all
processes in a given group—that is, the highest priority among those
processes; setpriority( ) sets the
base priority of all processes in a given group to a given
value.
内核通过
sys_getpriority( )和sys_setpriority( )服务例程来实现这些系统调用。它们本质上都作用于同一组参数:
The kernel implements these system calls by means of the
sys_getpriority( ) and sys_setpriority( ) service routines. Both of
them act essentially on the same group of parameters:
whichwhich标识进程组的值;它可以假设以下之一:
PRIO_PROCESSpid根据进程 ID(进程描述符的字段)选择进程。
PRIO_PGRP根据进程的组 ID(pgrp进程描述符的字段)选择进程。
PRIO_USERuid根据用户 ID(进程描述符的字段)选择进程。
The value that identifies the group of processes; it can assume one of the following:
PRIO_PROCESSSelects the processes according to their process ID
(pid field of the
process descriptor).
PRIO_PGRPSelects the processes according to their group ID
(pgrp field of the
process descriptor).
PRIO_USERSelects the processes according to their user ID
(uid field of the
process descriptor).
whowho用于选择进程的pid、
pgrp或字段的值uid(取决于 的值
)。which如果who为0,则其值设置为进程相应字段的值current。
The value of the pid,
pgrp, or uid field (depending on the value of
which) to be used for
selecting the processes. If who is 0, its value is set to that of
the corresponding field of the current process.
nicevalniceval新的基本优先级值(仅 需要sys_setpriority( ))。它的范围应介于 - 20(最高优先级)和 + 19(最低优先级)之间。
The new base priority value (needed only by sys_setpriority( )). It should range
between - 20 (highest priority) and + 19 (lowest
priority).
如前所述,只有具有CAP_SYS_NICE能力的进程才被允许增加自己的基本优先级或修改其他进程的基本优先级。
As stated before, only processes with a CAP_SYS_NICE capability are allowed to
increase their own base priority or to modify that of other
processes.
正如我们将在第 10 章中看到的,仅当发生某些错误时系统调用才会返回负值。因此,getpriority(
)不会返回 - 20 到 + 19 之间的正常 Nice 值,而是返回 1 到 40 之间的非负值。
As we will see in Chapter
10, system calls return a negative value only if some error
occurred. For this reason, getpriority(
) does not return a normal nice value ranging between - 20
and + 19, but rather a nonnegative value ranging between 1 and
40.
和系统调用分别返回并设置进程的 CPU 亲和性掩码——允许执行该进程的 CPU 的位掩码sched_getaffinity(
)。sched_setaffinity(
)该掩码存储在cpus_allowed进程描述符的字段中。
The sched_getaffinity(
) and sched_setaffinity(
) system calls respectively return and set up the CPU
affinity mask of a process—the bit mask of the CPUs that are allowed
to execute the process. This mask is stored in the cpus_allowed field of the process
descriptor.
系统sys_sched_getaffinity( )
调用服务例程通过调用查找进程描述符find_task_by_pid( ),然后返回相应字段的值cpus_allowed与可用CPU的位图进行AND运算。
The sys_sched_getaffinity( )
system call service routine looks up the process descriptor by
invoking find_task_by_pid( ), and
then returns the value of the corresponding cpus_allowed field ANDed with the bitmap of
the available CPUs.
系统sys_sched_setaffinity( )
调用稍微复杂一些。除了查找目标进程的描述符并更新该cpus_allowed字段之外,该函数还必须检查该进程是否包含在新的关联掩码中不再存在的 CPU 的运行队列中。在最坏的情况下,进程必须从一个运行队列移动到另一个运行队列。为了避免由于死锁和竞争条件导致的问题,这项工作是通过
迁移来完成的 内核线程(每个 CPU 有一个线程)。每当一个进程必须从一个运行队列移动rq1到另一个运行队列时,系统调用就会唤醒( )rq2的迁移线程,该线程依次将进程从 中删除并将其插入到 中。rq1rq1->migration_threadrq1rq2
The sys_sched_setaffinity( )
system call is a bit more complicated. Besides looking for the
descriptor of the target process and updating the cpus_allowed field, this function has to
check whether the process is included in a runqueue of a CPU that is
no longer present in the new affinity mask. In the worst case, the
process has to be moved from one runqueue to another one. To avoid
problems due to deadlocks and race conditions, this job is done by the
migration kernel threads (there is one thread per CPU). Whenever
a process has to be moved from a runqueue rq1 to another runqueue rq2, the system call awakes the migration
thread of rq1 (rq1->migration_thread), which in turn
removes the process from rq1 and
inserts it into rq2.
我们现在介绍一组系统调用,允许进程更改其调度规则,特别是成为实时进程。与往常一样,进程必须能够修改任何进程(包括其自身)的和进程描述符字段CAP_SYS_NICE的值。rt_prioritypolicy
We now introduce a group of system calls that allow
processes to change their scheduling discipline and, in particular, to
become real-time processes. As usual, a process must have a CAP_SYS_NICE capability to modify the values
of the rt_priority and policy process descriptor fields of any
process, including itself.
该sched_getscheduler( )
系统调用查询当前应用于参数pid
标识的进程的调度策略。如果pid等于 0,则检索调用进程的策略。成功时,系统调用返回进程的策略:SCHED_FIFO、SCHED_RR或SCHED_NORMAL(后者也称为
SCHED_OTHER)。相应的
sys_sched_getscheduler( )服务例程调用find_process_by_pid(
),它定位与给定对应的进程描述符并返回其字段pid的值。policy
The sched_getscheduler( )
system call queries the scheduling policy currently applied to the
process identified by the pid
parameter. If pid equals 0, the
policy of the calling process is retrieved. On success, the system
call returns the policy for the process: SCHED_FIFO, SCHED_RR, or SCHED_NORMAL (the latter is also called
SCHED_OTHER). The corresponding
sys_sched_getscheduler( ) service
routine invokes find_process_by_pid(
), which locates the process descriptor corresponding to
the given pid and returns the
value of its policy field.
该sched_setscheduler( )
系统调用为参数标识的进程设置调度策略和相关参数pid。如果pid等于0,则将设置调用进程的调度程序参数。
The sched_setscheduler( )
system call sets both the scheduling policy and the associated
parameters for the process identified by the parameter pid. If pid is equal to 0, the scheduler
parameters of the calling process will be set.
相应的sys_sched_setscheduler( )系统调用服务例程只需调用do_sched_setscheduler( ). 后一个函数检查参数指定的调度策略
policy和参数指定的新优先级是否param->sched_priority有效。它还检查进程是否具有CAP_SYS_NICE能力或其所有者是否具有超级用户权限。如果一切正常,它将从其运行队列中删除该进程(如果它是可运行的);更新进程的静态、实时和动态优先级;将进程插回到运行队列中;最后,如有必要,调用该
resched_task( )函数来抢占运行队列的当前进程。
The corresponding sys_sched_setscheduler( ) system call
service routine simply invokes do_sched_setscheduler( ). The latter
function checks whether the scheduling policy specified by the
policy parameter and the new
priority specified by the param->sched_priority parameter are
valid. It also checks whether the process has CAP_SYS_NICE capability or whether its
owner has superuser rights. If everything is OK, it removes the
process from its runqueue (if it is runnable); updates the static,
real-time, and dynamic priorities of the process; inserts the
process back in the runqueue; and finally invokes, if necessary, the
resched_task( ) function to
preempt the current process of the runqueue.
该sched_getparam( )
系统调用检索由 标识的进程的调度参数pid。如果为0,则检索进程pid的参数。current正如人们所期望的那样,相应的sys_sched_getparam(
)服务例程找到与 关联的进程描述符指针pid,将其rt_priority字段存储在类型为 的局部变量中sched_param,并调用
copy_to_user( )将其复制到参数指定的地址处的进程地址空间中param。
The sched_getparam( )
system call retrieves the scheduling parameters for the process
identified by pid. If pid is 0, the parameters of the current process are retrieved. The
corresponding sys_sched_getparam(
) service routine, as one would expect, finds the process
descriptor pointer associated with pid, stores its rt_priority field in a local variable of
type sched_param, and invokes
copy_to_user( ) to copy it into
the process address space at the address specified by the param parameter.
系统sched_setparam( )
调用类似于sched_setscheduler( ). 不同之处在于sched_setparam( )不允许调用者设置policy
字段的值。[ * ]相应的sys_sched_setparam( )服务例程调用do_sched_setscheduler( ),参数与 几乎相同sys_sched_setscheduler( )。
The sched_setparam( )
system call is similar to sched_setscheduler( ). The difference is
that sched_setparam( ) does not
let the caller set the policy
field's value.[*] The corresponding sys_sched_setparam( ) service routine
invokes do_sched_setscheduler( ),
with almost the same parameters as sys_sched_setscheduler( ).
系统sched_yield( )
调用允许进程自愿放弃CPU而不被挂起;该进程仍处于某种TASK_RUNNING状态,但调度程序将其放入运行队列的过期集合中(如果该进程是常规进程),或者放在运行队列列表的末尾(如果该进程是实时进程)。schedule(
)然后调用该函数。这样,具有相同动态优先级的其他进程就有机会运行。SCHED_FIFO
该调用主要由实时进程使用。
The sched_yield( )
system call allows a process to relinquish the CPU voluntarily
without being suspended; the process remains in a TASK_RUNNING state, but the scheduler puts
it either in the expired set of the runqueue (if the process is a
conventional one), or at the end of the runqueue list (if the
process is a real-time one). The schedule(
) function is then invoked. In this way, other processes
that have the same dynamic priority have a chance to run. The call
is used mainly by SCHED_FIFO
real-time processes.
和系统调用分别返回可与参数标识的调度策略一起使用的最小和最大实时静态优先级sched_get_priority_min(
)值。sched_get_priority_max(
)policy
The sched_get_priority_min(
) and sched_get_priority_max(
) system calls return, respectively, the minimum and the
maximum real-time static priority value that can be used with the
scheduling policy identified by the policy parameter.
如果是实时进程,则服务例程返回 1,否则返回0 sys_sched_get_priority_min(
)。current
The sys_sched_get_priority_min(
) service routine returns 1 if current is a real-time process, 0
otherwise.
如果是实时进程,则服务例程返回 99(最高优先级),否则返回0 sys_sched_get_priority_max(
)。
current
The sys_sched_get_priority_max(
) service routine returns 99 (the highest priority) if
current is a real-time process, 0
otherwise.
系统sched_rr_get_interval(
)调用将参数标识的实时进程的循环时间量写入存储在用户模式地址空间中的结构中pid
。如果pid为零,则系统调用写入当前进程的时间量。
The sched_rr_get_interval(
) system call writes into a structure stored in the User
Mode address space the Round Robin time quantum for the real-time
process identified by the pid
parameter. If pid is zero, the
system call writes the time quantum of the current process.
sys_sched_rr_get_interval( )像往常一样,调用相应的服务例程find_process_by_pid( )来检索与 关联的进程描述符pid。然后,它将所选进程的基本时间量转换为秒和纳秒,并将这些数字复制到用户模式结构中。传统上,FIFO实时过程的时间量等于0。
The corresponding sys_sched_rr_get_interval( ) service
routine invokes, as usual, find_process_by_pid( ) to retrieve the
process descriptor associated with pid. It then converts the base time
quantum of the selected process into seconds and nanoseconds and
copies the numbers into the User Mode structure. Conventionally, the
time quantum of a FIFO real-time process is equal to zero.
我们在第 2 章中看到了Linux 如何利用 80 × 86 的分段和分页电路将逻辑地址转换为物理地址。我们还提到,RAM 的某些部分被永久分配给内核,用于存储内核代码和静态内核数据结构。
We saw in Chapter 2 how Linux takes advantage of 80 × 86's segmentation and paging circuits to translate logical addresses into physical ones. We also mentioned that some portion of RAM is permanently assigned to the kernel and used to store both the kernel code and the static kernel data structures.
RAM 的剩余部分称为动态内存 。它是一种宝贵的资源,不仅进程需要,内核本身也需要。事实上,整个系统的性能取决于动态内存的管理效率。因此,当前所有的多任务操作系统都试图优化动态内存的使用,仅在需要时分配它并尽快释放它。图8-1示意性地显示了用作动态存储器的页框;详细信息请参见第 2 章“物理内存布局”部分。
The remaining part of the RAM is called dynamic memory . It is a valuable resource, needed not only by the processes but also by the kernel itself. In fact, the performance of the entire system depends on how efficiently dynamic memory is managed. Therefore, all current multitasking operating systems try to optimize the use of dynamic memory, assigning it only when it is needed and freeing it as soon as possible. Figure 8-1 shows schematically the page frames used as dynamic memory; see the section "Physical Memory Layout" in Chapter 2 for details.
本章由三个主要部分组成,描述内核如何分配动态内存供自己使用。“页帧管理”和“存储器区域管理”部分说明了用于处理物理连续存储器区域的两种不同技术,而“非连续存储器区域管理”部分说明了处理非连续存储器区域的第三种技术。在这些部分中,我们将讨论内存区域、内核映射、伙伴系统、slab 缓存和内存池等主题。
This chapter, which consists of three main sections, describes how the kernel allocates dynamic memory for its own use. The sections "Page Frame Management" and "Memory Area Management" illustrate two different techniques for handling physically contiguous memory areas, while the section "Noncontiguous Memory Area Management" illustrates a third technique that handles noncontiguous memory areas. In these sections we'll cover topics such as memory zones, kernel mappings, the buddy system, the slab cache, and memory pools.
我们在第 2 章的“硬件分页”一节中看到Intel Pentium 处理器如何使用两种不同的页帧大小:4 KB 和 4 MB(如果启用 PAE,则为 2 MB — 请参阅“物理地址扩展 (PAE)”一节)第 2 章中的“分页机制” )。Linux 采用较小的 4 KB 页框大小作为标准内存分配单元。这使事情变得更简单,原因有二:
We saw in the section "Paging in Hardware" in Chapter 2 how the Intel Pentium processor can use two different page frame sizes: 4 KB and 4 MB (or 2 MB if PAE is enabled—see the section "The Physical Address Extension (PAE) Paging Mechanism" in Chapter 2). Linux adopts the smaller 4 KB page frame size as the standard memory allocation unit. This makes things simpler for two reasons:
页面错误寻呼电路发出的异常很容易解释。请求的页面存在但不允许进程对其进行寻址,或者页面不存在。在第二种情况下,内存分配器必须找到一个空闲的4 KB 页框并将其分配给进程。
The Page Fault exceptions issued by the paging circuitry are easily interpreted. Either the page requested exists but the process is not allowed to address it, or the page does not exist. In the second case, the memory allocator must find a free 4 KB page frame and assign it to the process.
尽管 4 KB 和 4 MB 都是所有磁盘块大小的倍数,但在大多数情况下,使用较小的大小时,主内存和磁盘之间的数据传输效率更高。
Although both 4 KB and 4 MB are multiples of all disk block sizes, transfers of data between main memory and disks are in most cases more efficient when the smaller size is used.
内核必须跟踪每个页框的当前状态。例如,它必须能够区分用于包含属于进程的页面的页框与包含内核代码或内核数据结构的页框。同样,它必须能够确定动态内存中的页框是否空闲。如果动态内存中的页框不包含任何有用的数据,则该页框是空闲的。当页框包含用户态进程的数据、软件缓存的数据、动态分配的内核数据结构、设备驱动程序的缓冲数据、内核模块的代码等时,它不是空闲的。
The kernel must keep track of the current status of each page frame. For instance, it must be able to distinguish the page frames that are used to contain pages that belong to processes from those that contain kernel code or kernel data structures. Similarly, it must be able to determine whether a page frame in dynamic memory is free. A page frame in dynamic memory is free if it does not contain any useful data. It is not free when the page frame contains data of a User Mode process, data of a software cache, dynamically allocated kernel data structures, buffered data of a device driver, code of a kernel module, and so on.
页框的状态信息保存在类型为页的描述符中,其字段如表8-1page所示。所有页面描述符都存储在mem_map数组中。由于每个描述符的长度为32字节,因此所需的空间mem_map略小于整个RAM的1%。该virt_to_page(addr)
宏产生与线性地址关联的页描述符的地址addr。该pfn_to_page(pfn)宏产生与编号为 的页框关联的页描述符的地址
pfn。
State information of a page frame is kept in a page descriptor
of type page, whose fields are
shown in Table 8-1.
All page descriptors are stored in the mem_map array. Because each descriptor is 32
bytes long, the space required by mem_map is slightly less than 1% of the
whole RAM. The virt_to_page(addr)
macro yields the address of the page descriptor associated with the
linear address addr. The pfn_to_page(pfn) macro yields the address of
the page descriptor associated with the page frame having number
pfn.
表 8-1。页面描述符的字段
Table 8-1. The fields of the page descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 标志数组(参见表8-2)。还对页框所属的区域编号进行编码。 Array of flags (see Table 8-2). Also encodes the zone number to which the page frame belongs. |
| | 页框的引用计数器。 Page frame's reference counter. |
| | 引用页框的页表条目数( Number of Page Table entries that
refer to the page frame ( |
| | 可供使用页面的内核组件使用(例如,在缓冲区页面的情况下,它是缓冲区头指针;请参阅第15 章中的“块缓冲区和缓冲区头” )。如果页面空闲,则该字段由好友系统使用(见本章后面)。 Available to the kernel component that is using the page (for instance, it is a buffer head pointer in case of buffer page; see "Block Buffers and Buffer Heads" in Chapter 15). If the page is free, this field is used by the buddy system (see later in this chapter). |
| | 当页面被插入到页面缓存中时使用(参见第15章中的“页面缓存” 部分),或者当页面属于匿名区域时(参见第17章中的“匿名页面的反向映射”部分)。 Used when the page is inserted into the page cache (see the section "The Page Cache" in Chapter 15), or when it belongs to an anonymous region (see the section "Reverse Mapping for Anonymous Pages" in Chapter 17). |
| | 由多个具有不同含义的内核组件使用。例如,它标识存储在页面磁盘映像内或匿名区域内的页面框架中的数据的位置(第15章),或者它存储换出的页面标识符(第17章)。 Used by several kernel components with different meanings. For instance, it identifies the position of the data stored in the page frame within the page's disk image or within an anonymous region (Chapter 15), or it stores a swapped-out page identifier (Chapter 17). |
| | 包含指向最近最少使用的双向链接页面列表的指针。 Contains pointers to the least recently used doubly linked list of pages. |
您现在不必完全理解页面描述符中所有字段的作用。在接下来的章节中,我们经常会回到页面描述符的字段。此外,根据页框是否空闲或哪个内核组件正在使用该页框,几个字段具有不同的含义。
You don't have to fully understand the role of all fields in the page descriptor right now. In the following chapters, we often come back to the fields of the page descriptor. Moreover, several fields have different meaning, according to whether the page frame is free or what kernel component is using the page frame.
让我们更详细地描述其中两个字段:
Let's describe in greater detail two of the fields:
_count_count页面的使用引用计数器。如果设置为
-1,则相应的页框是空闲的,可以分配给任何进程或内核本身。如果设置为大于或等于0的值,则该页框被分配给一个或多个进程或者用于存储一些内核数据结构。函数page_count( )返回该字段的值_count加一,即该页面的用户数。
A usage reference counter for the page. If it is set to
-1, the corresponding page
frame is free and can be assigned to any process or to the
kernel itself. If it is set to a value greater than or equal to
0, the page frame is assigned to one or more processes or is
used to store some kernel data structures. The page_count( ) function returns the
value of the _count field
increased by one, that is, the number of users of the
page.
flagsflags包括最多 32 个描述页框状态的标志(参见表8-2 )。对于每个PG_ xyz标志,内核定义了一些操纵其值的宏。通常,Page
Xyz宏返回标志的值,而SetPage
Xyz和ClearPage Xyz
宏分别设置和清除相应的位。
Includes up to 32 flags (see Table 8-2) that
describe the status of the page frame. For each PG_ xyz flag, the
kernel defines some macros that manipulate its value. Usually,
the Page
Xyz macro returns the value of the flag,
while the SetPage
Xyz and ClearPage Xyz
macro set and clear the corresponding bit, respectively.
表 8-2。描述页框状态的标志
Table 8-2. Flags describing the status of a page frame
旗帜名称 Flag name | 意义 Meaning |
|---|---|
| 页面已被锁定;例如,它涉及磁盘I/O操作。 The page is locked; for instance, it is involved in a disk I/O operation. |
| 传输页面时发生 I/O 错误。 An I/O error occurred while transferring the page. |
| 该页面最近被访问过。 The page has been recently accessed. |
| 该标志在完成读操作后设置,除非发生磁盘 I/O 错误。 This flag is set after completing a read operation, unless a disk I/O error happened. |
| 该页面已被修改(参见第 17 章中的“实施 PFRA ”部分)。 The page has been modified (see the section "Implementing the PFRA" in Chapter 17). |
| 该页面位于活动或非活动页面列表中(请参阅第 17 章中的“最近最少使用(LRU)列表”部分)。 The page is in the active or inactive page list (see the section "The Least Recently Used (LRU) Lists" in Chapter 17). |
| 该页面位于活动页面列表中(请参阅第 17 章中的“最近最少使用(LRU)列表”部分)。 The page is in the active page list (see the section "The Least Recently Used (LRU) Lists" in Chapter 17). |
| 页框包含在一个slab中(参见本章后面的“内存区域管理”部分)。 The page frame is included in a slab (see the section "Memory Area Management" later in this chapter). |
| 页框属于
区域(参见下面的“非统一内存访问(NUMA) The page frame belongs to the
|
| 由一些文件系统使用,例如 Ext2 和 Ext3(参见第 18 章)。 Used by some filesystems such as Ext2 and Ext3 (see Chapter 18). |
| 不在80×86架构上使用。 Not used on the 80 × 86 architecture. |
| 页框是为内核代码保留的或者不可用。 The page frame is reserved for kernel code or is unusable. |
|
The |
| 该页正在通过该 The page is being written to disk by
means of the |
| 用于系统挂起/恢复。 Used for system suspend/resume. |
| 页框是通过扩展分页机制来处理的(参见第2章的“扩展分页”部分)。 The page frame is handled through the extended paging mechanism (see the section "Extended Paging" in Chapter 2). |
| 该页属于交换缓存(请参阅第 17 章中的“交换缓存”部分)。 The page belongs to the swap cache (see the section "The Swap Cache" in Chapter 17). |
| 页框中的所有数据对应于磁盘上分配的块。 All data in the page frame corresponds to blocks allocated on disk. |
| 该页已被标记为写入磁盘以回收内存。 The page has been marked to be written to disk in order to reclaim memory. |
| 用于系统挂起/恢复。 Used for system suspend/resume. |
我们习惯于将计算机内存视为同质的共享资源。不考虑硬件缓存的作用,我们预计 CPU 访问内存位置所需的时间基本相同,无论该位置的物理地址和 CPU 是什么。不幸的是,这个假设在某些架构中并不成立。例如,对于某些多处理器 Alpha 或 MIPS 计算机来说,情况并非如此。
We are used to thinking of the computer's memory as a homogeneous, shared resource. Disregarding the role of the hardware caches, we expect the time required for a CPU to access a memory location to be essentially the same, regardless of the location's physical address and the CPU. Unfortunately, this assumption is not true in some architectures. For instance, it is not true for some multiprocessor Alpha or MIPS computers.
Linux 2.6 支持非均匀内存访问( NUMA ) 模型,其中给定 CPU 的不同内存位置的访问时间可能会有所不同。系统的物理内存被分区在多个 节点上 。给定 CPU 访问单个节点内的页面所需的时间是相同的。但是,对于两个不同的 CPU,这个时间可能会不同。对于每个 CPU,内核都会通过仔细选择 CPU 最常引用的内核数据结构的存储位置来尝试最大程度地减少对昂贵节点的访问次数。[ * ]
Linux 2.6 supports the Non-Uniform Memory Access (NUMA) model, in which the access times for different memory locations from a given CPU may vary. The physical memory of the system is partitioned in several nodes . The time needed by a given CPU to access pages within a single node is the same. However, this time might not be the same for two different CPUs. For every CPU, the kernel tries to minimize the number of accesses to costly nodes by carefully selecting where the kernel data structures that are most often referenced by the CPU are stored.[*]
每个节点内的物理内存可以分为几个区域,我们将在下一节中看到。每个节点都有一个类型为 的描述符,其字段如表8-3pg_data_t所示。所有节点描述符都存储在一个单链表中,其第一个元素由变量指向。pgdat_list
The physical memory inside each node can be split into several
zones, as we will see in the next section. Each node has a descriptor
of type pg_data_t, whose fields are
shown in Table 8-3.
All node descriptors are stored in a singly linked list, whose first
element is pointed to by the pgdat_list variable.
表 8-3。节点描述符的字段
Table 8-3. The fields of the node descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 节点的区域描述符数组 Array of zone descriptors of the node |
| |
Array of |
| | 节点中的区域数量 Number of zones in the node |
| | 节点的页面描述符数组 Array of page descriptors of the node |
| | 在内核初始化阶段使用 Used in the kernel initialization phase |
| | 节点中第一个页框的索引 Index of the first page frame in the node |
| | 内存节点的大小,不包括孔(在页框中) Size of the memory node, excluding holes (in page frames) |
| | 节点的大小,包括孔(在页框中) Size of the node, including holes (in page frames) |
| | 节点标识符 Identifier of the node |
| | 内存节点列表中的下一项 Next item in the memory node list |
等待队列头 wait_queue_head_t | kswapd_等待 kswapd_wait | kswapd的等待队列 pageout 守护进程(参见第 17 章中的“定期回收”部分) Wait queue for the kswapd pageout daemon (see the section "Periodic Reclaiming" in Chapter 17) |
结构任务结构 * struct task_struct * | 克斯瓦普德 kswapd | 指向kswapd内核线程的进程描述符的指针 Pointer to the process descriptor of the kswapd kernel thread |
整数 int | kswapd_max_order kswapd_max_order | kswapd创建的空闲块的对数大小 Logarithmic size of free blocks to be created by kswapd |
和往常一样,我们最关心的是 80 × 86 架构。IBM 兼容 PC 使用统一内存访问模型 (UMA),因此实际上并不需要 NUMA 支持。然而,即使内核中没有编译 NUMA 支持,Linux 也会使用包含所有系统物理内存的单个节点。因此,该pgdat_list变量指向一个由存储在变量中的单个元素(节点 0 描述符)组成的列表
contig_page_data。
As usual, we are mostly concerned with the 80 × 86 architecture.
IBM-compatible PCs use the Uniform Memory Access model (UMA), thus the
NUMA support is not really required. However, even if NUMA support is
not compiled in the kernel, Linux makes use of a single node that
includes all system physical memory. Thus, the pgdat_list variable points to a list
consisting of a single element—the node 0 descriptor—stored in the
contig_page_data variable.
在 80 × 86 架构上,将物理内存分组在单个节点中可能显得毫无用处;然而,这种方法使得内存处理代码更加可移植,因为内核可以假设物理内存被分区在所有体系结构中的一个或多个节点中。[ * ]
On the 80 × 86 architecture, grouping the physical memory in a single node might appear useless; however, this approach makes the memory handling code more portable, because the kernel can assume that the physical memory is partitioned in one or more nodes in all architectures.[*]
在理想的计算机体系结构中,页框是一个可以用于任何用途的内存存储单元:存储内核和用户数据、缓冲磁盘数据等等。每种页的数据都可以存储在页框中,没有限制。
In an ideal computer architecture, a page frame is a memory storage unit that can be used for anything: storing kernel and user data, buffering disk data, and so on. Every kind of page of data can be stored in a page frame, without limitations.
然而,真实的计算机体系结构具有硬件限制,可能会限制页框的使用方式。特别是,Linux 内核必须处理 80 × 86 架构的两个硬件限制:
However, real computer architectures have hardware constraints that may limit the way page frames can be used. In particular, the Linux kernel must deal with two hardware constraints of the 80 × 86 architecture:
旧 ISA 总线的直接内存访问 (DMA) 处理器有一个很大的限制:它们只能寻址 RAM 的前 16 MB。
The Direct Memory Access (DMA) processors for old ISA buses have a strong limitation: they are able to address only the first 16 MB of RAM.
在拥有大量 RAM 的现代 32 位计算机中,CPU 无法直接访问所有物理内存,因为线性地址空间太小。
In modern 32-bit computers with lots of RAM, the CPU cannot directly access all physical memory because the linear address space is too small.
为了应对这两个限制,Linux 2.6将每个内存节点的物理内存划分为三个 区域。在 80 × 86 UMA 架构中,区域是:
To cope with these two limitations, Linux 2.6 partitions the physical memory of every memory node into three zones. In the 80 × 86 UMA architecture the zones are:
ZONE_DMAZONE_DMA包含低于 16 MB 的内存页帧
Contains page frames of memory below 16 MB
ZONE_NORMALZONE_NORMAL包含 16 MB 及以上且 896 MB 以下的内存页帧
Contains page frames of memory at and above 16 MB and below 896 MB
ZONE_HIGHMEMZONE_HIGHMEM包含 896 MB 及以上的内存页帧
Contains page frames of memory at and above 896 MB
该ZONE_DMA区域包括可由旧的基于 ISA 的设备通过 DMA 使用的页框。(第 13 章中的“直接内存访问(DMA) ”部分提供了有关 DMA 的更多详细信息。)
The ZONE_DMA zone includes
page frames that can be used by old ISA-based devices by means of the
DMA. (The section "Direct
Memory Access (DMA)" in Chapter 13 gives further details
on DMA.)
和区域包括“正常”页框,内核可以通过线性地址空间的第四个千兆字节中的线性映射直接访问这些页框(请参阅第 2 章ZONE_DMA中的“内核页表”ZONE_NORMAL部分)。相反,该区域包含内核无法通过第四 GB 线性地址空间中的线性映射直接访问的页帧(请参阅本章后面的“高内存页帧的内核映射”一节)。在 64 位体系结构上该区域始终为空。ZONE_HIGHMEMZONE_HIGHMEM
The ZONE_DMA and ZONE_NORMAL zones include the "normal" page
frames that can be directly accessed by the kernel through the linear
mapping in the fourth gigabyte of the linear address space (see the
section "Kernel Page
Tables" in Chapter
2). Conversely, the ZONE_HIGHMEM zone includes page frames that
cannot be directly accessed by the kernel through the linear mapping
in the fourth gigabyte of linear address space (see the section "Kernel Mappings of High-Memory
Page Frames" later in this chapter). The ZONE_HIGHMEM zone is always empty on 64-bit
architectures.
每个内存区域都有自己的类型描述符zone。其字段如表8-4所示。
Each memory zone has its own descriptor of type zone. Its fields are shown in Table 8-4.
表 8-4。区域描述符的字段
Table 8-4. The fields of the zone descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 区域中的空闲页数。 Number of free pages in the zone. |
| | 区域的保留页数(请参阅本章后面的“保留页帧池”部分)。 Number of reserved pages of the zone (see the section "The Pool of Reserved Page Frames" later in this chapter). |
| | 页框回收低水印;也被区域分配器用作阈值(参见本章后面的“区域分配器”部分)。 Low watermark for page frame reclaiming; also used by the zone allocator as a threshold value (see the section "The Zone Allocator" later in this chapter). |
| | 页框回收高水印;也被区域分配器用作阈值。 High watermark for page frame reclaiming; also used by the zone allocator as a threshold value. |
| 低内存储备 lowmem_reserve | 指定每个区域中必须保留多少页帧来处理内存不足的紧急情况。 Specifies how many page frames in each zone must be reserved for handling low-on-memory critical situations. |
结构体per_cpu_pageset[] struct per_cpu_pageset[] | 页面集 pageset | 用于实现单页帧特殊缓存的数据结构(参见本章后面的“每CPU页帧缓存”部分)。 Data structure used to implement special caches of single page frames (see the section "The Per-CPU Page Frame Cache" later in this chapter). |
| | 自旋锁保护描述符。 Spin lock protecting the descriptor. |
结构体空闲区域[] struct free_area [] | 自由区域 free_area | 标识区域中的空闲页框块(请参阅本章后面的“好友系统算法”部分)。 Identifies the blocks of free page frames in the zone (see the section "The Buddy System Algorithm" later in this chapter). |
| | 活动和非活动列表的自旋锁。 Spin lock for the active and inactive lists. |
| | 区域中活动页面的列表(参见第 17 章)。 List of active pages in the zone (see Chapter 17). |
| | 区域中非活动页面的列表(请参阅第 17 章)。 List of inactive pages in the zone (see Chapter 17). |
无符号长 unsigned long | nr_扫描_活动 nr_scan_active | 回收内存时要扫描的活动页数(请参阅第 17 章中的“低内存回收”部分)。 Number of active pages to be scanned when reclaiming memory (see the section "Low On Memory Reclaiming" in Chapter 17). |
无符号长 unsigned long | nr_scan_inactive nr_scan_inactive | 回收内存时要扫描的非活动页数。 Number of inactive pages to be scanned when reclaiming memory. |
| | 区域活动列表中的页数。 Number of pages in the zone's active list. |
| | 区域非活动列表中的页数。 Number of pages in the zone's inactive list. |
| | 在区域中进行页框回收时使用的计数器。 Counter used when doing page frame reclaiming in the zone. |
整数 int | all_unreclaimable all_unreclaimable | 当区域充满不可回收的页面时设置标志。 Flag set when the zone is full of unreclaimable pages. |
整数 int | 临时优先级 temp_priority | 临时区域的优先级(在进行页框回收时使用)。 Temporary zone's priority (used when doing page frame reclaiming). |
整数 int | 上一个优先级 prev_priority | Zone的优先级范围在12到0之间(由页框回收算法使用,请参见第17章中的“低内存回收”部分)。 Zone's priority ranging between 12 and 0 (used by the page frame reclaiming algorithm, see the section "Low On Memory Reclaiming" in Chapter 17). |
等待队列头_t * wait_queue_head_t * | 等待表 wait_table | 等待区域页面之一的进程的等待队列的哈希表。 Hash table of wait queues of processes waiting for one of the pages of the zone. |
无符号长 unsigned long | 等待表大小 wait_table_size | 等待队列哈希表的大小。 Size of the wait queue hash table. |
无符号长 unsigned long | 等待表位 wait_table_bits | 等待队列哈希表数组大小的 2 幂次方。 Power-of-2 order of the size of the wait queue hash table array. |
结构 pglist_data * struct pglist_data * | 区域_pgdat zone_pgdat | 内存节点(参见前面的“非统一内存访问(NUMA) ”部分)。 Memory node (see the earlier section "Non-Uniform Memory Access (NUMA)"). |
结构页 * struct page * | 区域内存映射 zone_mem_map | 指向区域第一页描述符的指针。 Pointer to first page descriptor of the zone. |
无符号长 unsigned long | zone_start_pfn zone_start_pfn | 区域第一页框架的索引。 Index of the first page frame of the zone. |
无符号长 unsigned long | 跨页数 spanned_pages | 页面中区域的总大小,包括孔。 Total size of zone in pages, including holes. |
无符号长 unsigned long | 当前页面 present_pages | 页面中区域的总大小,不包括孔。 Total size of zone in pages, excluding holes. |
| | 指向区域常规名称的指针:“DMA”、“Normal”或“HighMem”。 Pointer to the conventional name of the zone: "DMA," "Normal," or "HighMem." |
该结构的许多字段zone
用于页框回收,将在
第17章中描述。
Many fields of the zone
structure are used for page frame reclaiming and will be described in
Chapter 17.
每个页面描述符都具有到内存节点和节点内包含相应页框的区域的链接。为了节省空间,这些链接不存储为经典指针;相反,它们被编码为存储在字段的高位中的索引flags。事实上,表征页框的标志数量是有限的,因此总是可以保留该字段的最高有效位flags来编码正确的存储节点和区域编号。[ * ]该page_zone( )
函数接收页面描述符的地址作为其参数;它读取的最高有效位flags页面描述符中的字段,然后通过查找数组确定相应区域描述符的地址zone_table。该数组在启动时使用所有内存节点的所有区域描述符的地址进行初始化。
Each page descriptor has links to the memory node and to the
zone inside the node that includes the corresponding page frame. To
save space, these links are not stored as classical pointers; rather,
they are encoded as indices stored in the high bits of the flags field. In fact, the number of flags
that characterize a page frame is limited, thus it is always possible
to reserve the most significant bits of the flags field to encode the proper memory node
and zone number.[*] The page_zone( )
function receives as its parameter the address of a page descriptor;
it reads the most significant bits of the flags field in the page descriptor, then it
determines the address of the corresponding zone descriptor by looking
in the zone_table array. This array
is initialized at boot time with the addresses of all zone descriptors
of all memory nodes.
当内核调用内存分配函数时,它必须指定包含所请求的页框的区域。内核通常指定它愿意使用哪些区域。例如,如果一个页帧必须直接映射到第四个 GB 的线性地址,但它不会用于 ISA DMA 传输,那么内核会请求 in 或 in 中的页ZONE_NORMAL帧ZONE_DMA。当然,ZONE_DMA只有在
ZONE_NORMAL没有空闲页框的情况下才应该获取页框。为了在内存分配请求中指定首选区域,内核使用zonelist数据结构,它是区域描述符指针的数组。
When the kernel invokes a memory allocation function, it must
specify the zones that contain the requested page frames. The kernel
usually specifies which zones it's willing to use. For instance, if a
page frame must be directly mapped in the fourth gigabyte of linear
addresses but it is not going to be used for ISA DMA transfers, then
the kernel requests a page frame either in ZONE_NORMAL or in ZONE_DMA. Of course, the page frame should
be obtained from ZONE_DMA only if
ZONE_NORMAL does not have free page
frames. To specify the preferred zones in a memory allocation request,
the kernel uses the zonelist data
structure, which is an array of zone descriptor pointers.
内存分配请求可以通过两种不同的方式来满足。如果有足够的可用内存,则可以立即满足请求。否则,必须进行一些内存回收,并且发出请求的内核控制路径将被阻塞,直到释放更多内存为止。
Memory allocation requests can be satisfied in two different ways. If enough free memory is available, the request can be satisfied immediately. Otherwise, some memory reclaiming must take place, and the kernel control path that made the request is blocked until additional memory has been freed.
但是,某些内核控制路径在请求内存时无法被阻止,例如,在处理中断或在关键区域内执行代码时会发生这种情况。在这些情况下,内核控制路径应该发出原子内存分配请求(使用标志;请参阅后面的“分区页帧分配器GFP_ATOMIC”部分)。原子请求永远不会阻塞:如果没有足够的空闲页面,分配就会失败。
However, some kernel control paths cannot be blocked while
requesting memory—this happens, for instance, when handling an
interrupt or when executing code inside a critical region. In these
cases, a kernel control path should issue atomic memory
allocation requests (using the GFP_ATOMIC flag; see the later section
"The Zoned Page Frame
Allocator"). An atomic request never blocks: if there are not
enough free pages, the allocation simply fails.
尽管无法确保原子内存分配请求永远不会失败,但内核会尽力最大程度地减少发生这种不幸事件的可能性。为此,内核为原子内存分配请求保留一个页帧池,仅在内存不足的情况下使用。
Although there is no way to ensure that an atomic memory allocation request never fails, the kernel tries hard to minimize the likelihood of this unfortunate event. In order to do this, the kernel reserves a pool of page frames for atomic memory allocation requests to be used only on low-on-memory conditions.
保留的内存量(以千字节为单位)存储在min_free_kbytes变量中。ZONE_DMA它的初始值是在内核初始化期间设置的,并且取决于直接映射到内核的第四GB线性地址中的物理内存量——也就是说,它取决于和ZONE_NORMAL内存区域中包含的页框数量:
The amount of the reserved memory (in kilobytes) is stored in
the min_free_kbytes variable. Its
initial value is set during kernel initialization and depends on the
amount of physical memory that is directly mapped in the kernel's
fourth gigabyte of linear addresses—that is, it depends on the number
of page frames included in the ZONE_DMA and ZONE_NORMAL memory zones:
但是,最初min_free_kbytes不能低于 128 且大于 65,536。[ * ]
However, initially min_free_kbytes cannot be lower than 128 and
greater than 65,536.[*]
和内存区域通过与其相对大小ZONE_DMA成ZONE_NORMAL比例的页框数量来贡献保留内存。例如,如果ZONE_NORMAL区域是 的八倍
ZONE_DMA,则八分之七的页框将从 中获取ZONE_NORMAL,八分之一从 中获取ZONE_DMA。
The ZONE_DMA and ZONE_NORMAL memory zones contribute to the
reserved memory with a number of page frames proportional to their
relative sizes. For instance, if the ZONE_NORMAL zone is eight times bigger than
ZONE_DMA, seven-eighths of the page
frames will be taken from ZONE_NORMAL and one-eighth from ZONE_DMA.
pages_min描述符的字段
存储zone区域内保留的页框的数量。正如我们将在第 17 章pages_low中看到的,该字段与和字段一起在页框回收算法中也发挥着作用
pages_high。该pages_low字段始终设置为 值的 5/4 pages_min,并且pages_high始终设置为 值的 3/2 pages_min。
The pages_min field of the
zone descriptor stores the number
of reserved page frames inside the zone. As we'll see in Chapter 17, this field plays also
a role for the page frame reclaiming algorithm, together with the
pages_low and pages_high fields. The pages_low field is always set to 5/4 of the
value of pages_min, and pages_high is always set to 3/2 of the value
of pages_min.
处理连续页框组的内存分配请求的内核子系统称为 分区页框分配器 。其主要组成如图8-2所示。
The kernel subsystem that handles the memory allocation requests for groups of contiguous page frames is called the zoned page frame allocator . Its main components are shown in Figure 8-2.
名为“区域分配器”的组件” 接收分配和释放动态内存的请求。在分配请求的情况下,该组件会搜索包含一组可以满足该请求的连续页框的内存区域(请参阅后面的“区域分配器”部分)。在每个区域内,页面框架由名为“伙伴系统”的组件处理(参见后面章节“ Buddy系统算法”)。为了获得更好的系统性能,缓存中会保留少量页框,以快速满足单个页框的分配请求(参见后面章节“ Per-CPU”页帧缓存”)。
The component named "zone allocator " receives the requests for allocation and deallocation of dynamic memory. In the case of allocation requests, the component searches a memory zone that includes a group of contiguous page frames that can satisfy the request (see the later section "The Zone Allocator"). Inside each zone, page frames are handled by a component named "buddy system " (see the later section "The Buddy System Algorithm"). To get better system performance, a small number of page frames are kept in cache to quickly satisfy the allocation requests for single page frames (see the later section "The Per-CPU Page Frame Cache").
可以通过使用六个略有不同的函数和宏来请求页框。除非另有说明,否则它们返回第一个分配页的线性地址,NULL如果分配失败则返回。
Page frames can be requested by using six slightly
different functions and macros. Unless otherwise stated, they return
the linear address of the first allocated page or return NULL if the allocation failed.
alloc_pages(gfp_mask,
order)alloc_pages(gfp_mask,
order)用于请求 2 个顺序
连续页框的宏。它返回第一个分配的页框的描述符的地址,
NULL如果分配失败则返回。
Macro used to request 2order
contiguous page frames. It returns the address of the
descriptor of the first allocated page frame or returns
NULL if the allocation
failed.
alloc_page(gfp_mask)alloc_page(gfp_mask)用于获取单个页框的宏;它扩展到:
alloc_pages(gfp_mask, 0)
它返回已分配页框的描述符的地址,NULL如果分配失败则返回。
Macro used to get a single page frame; it expands to:
alloc_pages(gfp_mask, 0)
It returns the address of the descriptor of the
allocated page frame or returns NULL if the allocation
failed.
_ _get_free_pages(gfp_mask,
order)_ _get_free_pages(gfp_mask,
order)函数与 类似alloc_pages( ),但它返回第一个分配页的线性地址。
Function that is similar to alloc_pages( ), but it returns the
linear address of the first allocated page.
_
_get_free_page(gfp_mask)_
_get_free_page(gfp_mask)用于获取单个页框的宏;它扩展到:
_ _get_free_pages(gfp_mask, 0)
Macro used to get a single page frame; it expands to:
_ _get_free_pages(gfp_mask, 0)
get_zeroed_page(gfp_mask)get_zeroed_page(gfp_mask)用于获取用零填充的页框的函数;它调用:
alloc_pages(gfp_mask | _ _GFP_ZERO, 0)
并返回获得的页框的线性地址。
Function used to obtain a page frame filled with zeros; it invokes:
alloc_pages(gfp_mask | _ _GFP_ZERO, 0)
and returns the linear address of the obtained page frame.
_ _get_dma_pages(gfp_mask,
order)_ _get_dma_pages(gfp_mask,
order)用于获取适合DMA的页框的宏;它扩展到:
_ _get_free_pages(gfp_mask | _ _GFP_DMA, 顺序)
Macro used to get page frames suitable for DMA; it expands to:
_ _get_free_pages(gfp_mask | _ _GFP_DMA, order)
该参数gfp_mask是一组标志,指定如何查找空闲页框。可以使用的标志如表8-5gfp_mask所示。
The parameter gfp_mask is a
group of flags that specify how to look for free page frames. The
flags that can be used in gfp_mask are shown in Table 8-5.
表 8-5。用于请求页框的标志
Table 8-5. Flag used to request page frames
旗帜 Flag | 描述 Description |
|---|---|
| 页框必须属于
The page frame must belong to the
|
| 页框可以属于
The page frame may belong to the
|
| 允许内核阻止当前进程等待空闲页帧。 The kernel is allowed to block the current process waiting for free page frames. |
| 允许内核访问保留页框池。 The kernel is allowed to access the pool of reserved page frames. |
| 允许内核在低内存页上执行 I/O 传输,以释放页帧。 The kernel is allowed to perform I/O transfers on low memory pages in order to free page frames. |
| 如果清除,则不允许内核执行与文件系统相关的操作。 If clear, the kernel is not allowed to perform filesystem-dependent operations. |
| 请求的页框可能是“冷”的(参见后面的“ Per-CPU 页框高速缓存”部分)。 The requested page frames may be "cold" (see the later section "The Per-CPU Page Frame Cache"). |
| 内存分配失败不会产生警告消息。 A memory allocation failure will not produce a warning message. |
| 内核不断重试内存分配,直到成功。 The kernel keeps retrying the memory allocation until it succeeds. |
| 与 相同 Same as |
| 不要重试失败的内存分配。 Do not retry a failed memory allocation. |
| Slab分配器不允许扩大slab缓存(参见后面的“ Slab分配器”部分)。 The slab allocator does not allow a slab cache to be enlarged (see the later section "The Slab Allocator"). |
| The page frame belongs to an extended page (see the section "Extended Paging" in Chapter 2). |
| 返回的页框(如果有)必须用零填充。 The page frame returned, if any, must be filled with zeros. |
实际上,Linux 使用表 8-6中所示的预定义标志值组合;组名称是您将遇到的六个页框分配函数的参数。
In practice, Linux uses the predefined combinations of flag values shown in Table 8-6; the group name is what you'll encounter as the argument of the six page frame allocation functions.
表 8-6。用于请求页框的标志值组
Table 8-6. Groups of flag values used to request page frames
团队名字 Group name | 对应标志 Corresponding flags |
|---|---|
| |
| |
| |
| |
| |
| |
和_ _GFP_DMA标志_ _GFP_HIGHMEM称为
区域修饰符 ; 它们指定内核在查找空闲页框时搜索的区域。node_zonelists节点描述符的字段是表示回退区域contig_page_data的区域描述符列表的数组
:对于区域修饰符的每个设置,相应的列表包括可用于满足内存分配请求的内存区域,以防原始区域是页框较短。在 80 × 86 UMA 架构中,回退区域如下:
The _ _GFP_DMA and _ _GFP_HIGHMEM flags are called
zone modifiers ; they specify the zones searched by the kernel while
looking for free page frames. The node_zonelists field of the contig_page_data node descriptor is an
array of lists of zone descriptors representing the
fallback zones: for each setting of the zone
modifiers, the corresponding list includes the memory zones that
could be used to satisfy the memory allocation request in case the
original zone is short on page frames. In the 80 × 86 UMA
architecture, the fallback zones are the following:
如果_ _GFP_DMA设置了该标志,则只能从内存区域获取页框ZONE_DMA。
If the _ _GFP_DMA flag
is set, page frames can be taken only from the ZONE_DMA memory zone.
否则,如果未_
_GFP_HIGHMEM设置该标志,则只能按照优先顺序从和内存区域获取页框。ZONE_NORMALZONE_DMA
Otherwise, if the _
_GFP_HIGHMEM flag is not set,
page frames can be taken only from the ZONE_NORMAL and the ZONE_DMA memory zones, in order of
preference.
否则(_
_GFP_HIGHMEM设置标志),可以按照优先顺序从ZONE_HIGHMEM、ZONE_NORMAL和内存区域获取页框。ZONE_DMA
Otherwise (the _
_GFP_HIGHMEM flag is set), page frames can be taken
from ZONE_HIGHMEM, ZONE_NORMAL, and ZONE_DMA memory zones, in order of
preference.
页框可以通过以下四个函数和宏来释放:
Page frames can be released through each of the following four functions and macros:
_ _free_pages(page,
order)_ _free_pages(page,
order)该函数检查 ; 指向的页面描述符
page。如果页框未被保留(即,如果标志PG_reserved等于0),则减少count描述符的字段。如果count
变为0,则假定
从对应的1个开始的
2个顺序page连续的页框不再被使用。在这种情况下,该函数将释放页框,如后面的“区域分配器”部分中所述。
This function checks the page descriptor pointed to by
page; if the page frame is
not reserved (i.e., if the PG_reserved flag is equal to 0), it
decreases the count field
of the descriptor. If count
becomes 0, it assumes that 2order
contiguous page frames starting from the one corresponding to
page are no longer used. In
this case, the function releases the page frames as explained
in the later section "The Zone
Allocator."
free_pages(addr,
order)free_pages(addr,
order)此函数与 类似,但它接收要释放的第一个页框的_
_free_pages( )线性地址作为参数。addr
This function is similar to _
_free_pages( ), but it receives as an argument the
linear address addr of the
first page frame to be released.
_
_free_page(page)_
_free_page(page)该宏释放具有指向的描述符的页框page;它扩展到:
_ _free_pages(页, 0)
This macro releases the page frame having the descriptor
pointed to by page; it
expands to:
_ _free_pages(page, 0)
free_page(addr)free_page(addr)该宏释放具有线性地址的页框addr;它扩展到:
空闲页面(地址,0)
This macro releases the page frame having the linear
address addr; it expands
to:
free_pages(addr, 0)
对应于直接映射物理内存的末尾(即高端内存的开头)的线性地址存储在变量中high_memory,该地址设置为 896 MB。896 MB 边界以上的页帧通常不会映射到内核线性地址空间的第四 GB 中,因此内核无法直接访问它们。这意味着返回分配页框线性地址的每个页分配器函数不适用于高内存页框,即内存ZONE_HIGHMEM区的页框。
The linear address that corresponds to the end of the
directly mapped physical memory, and thus to the beginning of the high
memory, is stored in the high_memory variable, which is set to 896
MB. Page frames above the 896 MB boundary are not generally mapped in
the fourth gigabyte of the kernel linear address spaces, so the kernel
is unable to directly access them. This implies that each page
allocator function that returns the linear address of the assigned
page frame doesn't work for high-memory page frames, that is, for page frames in the ZONE_HIGHMEM memory zone.
例如,假设调用内核_ _get_free_pages(GFP_HIGHMEM,0)在高端内存中分配一个页框。如果分配器在高端内存中分配了一个页框,则_ _get_free_pages( )
无法返回其线性地址,因为它不存在;因此,该函数返回NULL。反过来,内核也无法使用页框;更糟糕的是,页框无法释放,因为内核已经失去了对它的跟踪。
For instance, suppose that the kernel invoked _ _get_free_pages(GFP_HIGHMEM,0) to allocate
a page frame in high memory. If the allocator assigned a page frame in
high memory, _ _get_free_pages( )
cannot return its linear address because it doesn't exist; thus, the
function returns NULL. In turn, the
kernel cannot use the page frame; even worse, the page frame cannot be
released because the kernel has lost track of it.
这个问题在 64 位硬件平台上不存在,因为可用的线性地址空间远大于可安装的 RAM 量——简而言之,ZONE_HIGHMEM这些架构的区域总是空的。然而,在 80 × 86 架构等 32 位平台上,Linux 设计人员必须找到某种方法来允许内核利用所有可用 RAM,最高可达 PAE 支持的 64 GB。采用的方法如下:
This problem does not exist on 64-bit hardware platforms,
because the available linear address space is much larger than the
amount of RAM that can be installed—in short, the ZONE_HIGHMEM zone of these architectures is
always empty. On 32-bit platforms such as the 80 × 86 architecture,
however, Linux designers had to find some way to allow the kernel to
exploit all the available RAM, up to the 64 GB supported by PAE. The
approach adopted is the following:
高内存页框的分配仅通过该alloc_pages( )
函数及其alloc_page( )
快捷方式完成。这些函数不返回第一个分配的页框的线性地址,因为如果该页框属于高端内存,那么这样的线性地址根本不存在。相反,这些函数返回第一个分配的页帧的页描述符的线性地址。这些线性地址始终存在,因为所有页面描述符在内核初始化期间都被一次性分配在低内存中。
The allocation of high-memory page frames is done only
through the alloc_pages( )
function and its alloc_page( )
shortcut. These functions do not return the linear address of the
first allocated page frame, because if the page frame belongs to
the high memory, such linear address simply does not exist.
Instead, the functions return the linear address of the page
descriptor of the first allocated page frame. These linear
addresses always exist, because all page descriptors are allocated
in low memory once and forever during the kernel
initialization.
内核无法访问高端内存中没有线性地址的页帧。因此,内核线性地址空间的最后 128 MB 的一部分专用于映射高内存页帧。当然,这种映射是临时的,否则只能访问 128 MB 的高端内存。相反,通过回收线性地址,可以访问整个高端存储器,尽管是在不同的时间。
Page frames in high memory that do not have a linear address cannot be accessed by the kernel. Therefore, part of the last 128 MB of the kernel linear address space is dedicated to mapping high-memory page frames. Of course, this kind of mapping is temporary, otherwise only 128 MB of high memory would be accessible. Instead, by recycling linear addresses the whole high memory can be accessed, although at different times.
内核使用三种不同的机制来映射高端内存中的页帧;它们被称为永久内核映射、临时内核映射和 非连续内存分配。在本节中,我们将介绍前两种技术;第三个将在本章后面的“非连续内存区域管理”一节中讨论。
The kernel uses three different mechanisms to map page frames in high memory; they are called permanent kernel mapping, temporary kernel mapping, and noncontiguous memory allocation. In this section, we'll cover the first two techniques; the third one is discussed in the section "Noncontiguous Memory Area Management" later in this chapter.
建立永久内核映射可能会阻塞当前进程;当不存在可用作高端内存中页框上的“窗口”的空闲页表条目时,就会发生这种情况。因此,无法在中断处理程序和可延迟函数中建立永久内核映射。相反,建立临时内核映射永远不需要阻塞当前进程;然而,它的缺点是只能同时建立很少的临时内核映射。
Establishing a permanent kernel mapping may block the current process; this happens when no free Page Table entries exist that can be used as "windows" on the page frames in high memory. Thus, a permanent kernel mapping cannot be established in interrupt handlers and deferrable functions. Conversely, establishing a temporary kernel mapping never requires blocking the current process; its drawback, however, is that very few temporary kernel mappings can be established at the same time.
使用临时内核映射的内核控制路径必须确保没有其他内核控制路径使用相同的映射。这意味着内核控制路径永远不会阻塞,否则另一个内核控制路径可能会使用同一窗口来映射其他一些高内存页。
A kernel control path that uses a temporary kernel mapping must ensure that no other kernel control path is using the same mapping. This implies that the kernel control path can never block, otherwise another kernel control path might use the same window to map some other high memory page.
当然,这些技术都不允许同时寻址整个 RAM。毕竟,用于映射高端内存的线性地址空间不足 128 MB,而 PAE 支持具有高达 64 GB RAM 的系统。
Of course, none of these techniques allow addressing the whole RAM simultaneously. After all, less than 128 MB of linear address space are left for mapping the high memory, while PAE supports systems having up to 64 GB of RAM.
永久内核映射允许内核建立高内存页帧到内核地址空间的持久映射。他们在主内核页表中使用专用页表。该pkmap_page_table变量存储该页表的地址,而LAST_PKMAP宏则生成条目数。与往常一样,页表包括 512 或 1,024 个条目,具体取决于 PAE 是否启用或禁用(请参阅第2 章中的“物理地址扩展(PAE)分页机制”部分);因此,内核一次最多可以访问 2 或 4 MB 的高端内存。
Permanent kernel mappings allow the kernel to
establish long-lasting mappings of high-memory page frames into the
kernel address space. They use a dedicated Page Table in the master
kernel page tables . The pkmap_page_table variable stores the
address of this Page Table, while the LAST_PKMAP macro yields the number of
entries. As usual, the Page Table includes either 512 or 1,024
entries, according to whether PAE is enabled or disabled (see the
section "The Physical
Address Extension (PAE) Paging Mechanism" in Chapter 2); thus, the kernel can
access at most 2 or 4 MB of high memory at once.
页表映射从 开始的线性地址
PKMAP_BASE。该pkmap_count数组包括LAST_PKMAP计数器,pkmap_page_table页表的每个条目都有一个计数器。我们区分三种情况:
The Page Table maps the linear addresses starting from
PKMAP_BASE. The pkmap_count array includes LAST_PKMAP counters, one for each entry of
the pkmap_page_table Page Table.
We distinguish three cases:
相应的页表项不映射任何高端内存页框并且可用。
The corresponding Page Table entry does not map any high-memory page frame and is usable.
相应的页表条目不映射任何高内存页帧,但它不能被使用,因为相应的TLB条目自上次使用以来尚未被刷新。
The corresponding Page Table entry does not map any high-memory page frame, but it cannot be used because the corresponding TLB entry has not been flushed since its last usage.
相应的页表条目映射一个高端内存页框,该页框恰好由n -1 个内核组件使用。
The corresponding Page Table entry maps a high-memory page frame, which is used by exactly n − 1 kernel components.
跟踪高内存页帧和由永久内核映射引起的线性地址之间的关联,内核使用page_address_htable哈希表。该表包含page_address_map
当前映射的高端内存中的每个页帧的一个数据结构。反过来,该数据结构包含指向页描述符的指针和分配给页框的线性地址。
To keep track of the association between high memory page
frames and linear addresses induced by permanent kernel
mappings , the kernel makes use of the page_address_htable hash table. This table
contains one page_address_map
data structure for each page frame in high memory that is currently
mapped. In turn, this data structure contains a pointer to the page
descriptor and the linear address assigned to the page frame.
该page_address( )
函数返回与页框关联的线性地址,或者NULL页框是否位于高端内存且未映射。该函数接收页面描述符指针作为其参数page,区分两种情况:
The page_address( )
function returns the linear address associated with the page frame,
or NULL if the page frame is in
high memory and is not mapped. This function, which receives as its
parameter a page descriptor pointer page, distinguishes two cases:
如果页框不在高端内存(PG_highmem标志清零),则线性地址一直存在,通过计算页框索引,转换为物理地址,最后推导出物理地址对应的线性地址。这是通过以下代码完成的:
_ _va((unsigned long)(页 - mem_map) << 12)
If the page frame is not in high memory (PG_highmem flag clear), the linear
address always exists and is obtained by computing the page
frame index, converting it into a physical address, and finally
deriving the linear address corresponding to the physical
address. This is accomplished by the following code:
_ _va((unsigned long)(page - mem_map) << 12)
如果页框位于高端内存(PG_highmem设置了标志),则该函数将查找page_address_htable哈希表。如果在哈希表中找到页框,page_address( )则返回其线性地址,否则返回NULL。
If the page frame is in high memory (PG_highmem flag set), the function
looks into the page_address_htable hash table. If the
page frame is found in the hash table, page_address( ) returns its linear
address, otherwise it returns NULL.
该kmap( )函数建立永久的内核映射。它本质上等价于下面的代码:
The kmap( ) function
establishes a permanent kernel mapping. It is essentially equivalent
to the following code:
void * kmap(结构页 * 页)
{
if (!PageHighMem(页))
返回页面地址(页面);
返回 kmap_high(页);
}void * kmap(struct page * page)
{
if (!PageHighMem(page))
return page_address(page);
return kmap_high(page);
}kmap_high( )如果页框确实属于高端内存,则调用该函数。该函数本质上等价于以下代码:
The kmap_high( ) function
is invoked if the page frame really belongs to high memory. The
function is essentially equivalent to the following code:
void * kmap_high(结构页 * 页)
{
无符号长 vaddr;
spin_lock(&kmap_lock);
vaddr = (无符号长) page_address(页);
如果(!vaddr)
vaddr = map_new_virtual(页面);
pkmap_count[(vaddr-PKMAP_BASE) >> PAGE_SHIFT]++;
spin_unlock(&kmap_lock);
返回(无效*)vaddr;
}void * kmap_high(struct page * page)
{
unsigned long vaddr;
spin_lock(&kmap_lock);
vaddr = (unsigned long) page_address(page);
if (!vaddr)
vaddr = map_new_virtual(page);
pkmap_count[(vaddr-PKMAP_BASE) >> PAGE_SHIFT]++;
spin_unlock(&kmap_lock);
return (void *) vaddr;
}该函数获取kmap_lock自旋锁以保护页表免受多处理器系统中的并发访问。请注意,无需禁用中断,因为kmap( )中断处理程序和可延迟函数无法调用中断。接下来,该kmap_high( )函数检查页框是否已通过调用进行映射page_address( )。如果不是,则调用该函数map_new_virtual( )将页框物理地址插入到哈希表的条目中pkmap_page_table并向page_address_htable哈希表添加元素。然后kmap_high( )
增加与页框线性地址相对应的计数器,以考虑调用该函数的新内核组件。最后,kmap_high(
)释放kmap_lock自旋锁并返回映射页框的线性地址。
The function gets the kmap_lock spin lock to protect the Page
Table against concurrent accesses in multiprocessor systems. Notice
that there is no need to disable the interrupts, because kmap( ) cannot be invoked by interrupt
handlers and deferrable functions. Next, the kmap_high( ) function checks whether the
page frame is already mapped by invoking page_address( ). If not, the function
invokes map_new_virtual( ) to
insert the page frame physical address into an entry of pkmap_page_table and to add an element to
the page_address_htable hash
table. Then kmap_high( )
increases the counter corresponding to the linear address of the
page frame to take into account the new kernel component that
invoked this function. Finally, kmap_high(
) releases the kmap_lock spin lock and returns the linear
address that maps the page frame.
该map_new_virtual( )
函数本质上执行两个嵌套循环:
The map_new_virtual( )
function essentially executes two nested loops:
为了 (;;) {
整数计数;
DECLARE_WAITQUEUE(等待,当前);
for (count = LAST_PKMAP; count > 0; --count) {
Last_pkmap_nr = (last_pkmap_nr + 1) & (LAST_PKMAP - 1);
如果(!last_pkmap_nr){
刷新_all_zero_pkmaps();
计数 = LAST_PKMAP;
}
if (!pkmap_count[last_pkmap_nr]) {
无符号长 vaddr = PKMAP_BASE +
(last_pkmap_nr << PAGE_SHIFT);
set_pte(&(pkmap_page_table[last_pkmap_nr]),
mk_pte(页, _ _pgprot(0x63)));
pkmap_count[last_pkmap_nr] = 1;
set_page_address(页, (void *) vaddr);
返回vaddr;
}
}
当前->状态 = TASK_UNINTERRUPTIBLE;
add_wait_queue(&pkmap_map_wait, &wait);
spin_unlock(&kmap_lock);
日程( );
删除_等待队列(&pkmap_map_wait,&等待);
spin_lock(&kmap_lock);
if (页面地址(页面))
返回(无符号长)page_address(页);
} for (;;) {
int count;
DECLARE_WAITQUEUE(wait, current);
for (count = LAST_PKMAP; count > 0; --count) {
last_pkmap_nr = (last_pkmap_nr + 1) & (LAST_PKMAP - 1);
if (!last_pkmap_nr) {
flush_all_zero_pkmaps( );
count = LAST_PKMAP;
}
if (!pkmap_count[last_pkmap_nr]) {
unsigned long vaddr = PKMAP_BASE +
(last_pkmap_nr << PAGE_SHIFT);
set_pte(&(pkmap_page_table[last_pkmap_nr]),
mk_pte(page, _ _pgprot(0x63)));
pkmap_count[last_pkmap_nr] = 1;
set_page_address(page, (void *) vaddr);
return vaddr;
}
}
current->state = TASK_UNINTERRUPTIBLE;
add_wait_queue(&pkmap_map_wait, &wait);
spin_unlock(&kmap_lock);
schedule( );
remove_wait_queue(&pkmap_map_wait, &wait);
spin_lock(&kmap_lock);
if (page_address(page))
return (unsigned long) page_address(page);
}在内部循环中,该函数扫描所有计数器,pkmap_count直到找到空值。if当在 中找到未使用的条目时,大块就会运行pkmap_count。该块确定与该条目相对应的线性地址,在页表中为其创建一个条目pkmap_page_table,将计数设置为 1(因为该条目现在已被使用),调用
在哈希表set_page_address( )中插入一个新元素page_address_htable,并返回线性地址。
In the inner loop, the function scans all counters in pkmap_count until it finds a null value.
The large if block runs when an
unused entry is found in pkmap_count. That block determines the
linear address corresponding to the entry, creates an entry for it
in the pkmap_page_table Page
Table, sets the count to 1 because the entry is now used, invokes
set_page_address( ) to insert a
new element in the page_address_htable hash table, and
returns the linear address.
该函数从上次停止的地方开始,循环遍历数组pkmap_count。它通过保存一个名为页表last_pkmap_nr中最后使用的条目的索引的变量来实现这一点pkmap_page_table
。因此,搜索从上次调用该map_new_virtual(
)函数时留下的位置开始。
The function starts where it left off last time, cycling
through the pkmap_count array. It
does this by preserving in a variable named last_pkmap_nr the index of the last used
entry in the pkmap_page_table
Page Table. Thus, the search starts from where it was left in the
last invocation of the map_new_virtual(
) function.
当到达最后一个计数器时pkmap_count,搜索从索引 0 处的计数器重新开始。但是,在继续之前,
map_new_virtual( )调用该
flush_all_zero_pkmaps( )
函数,该函数将启动另一次计数器扫描,查找值为 1 的计数器。每个具有值的计数器为 1 表示其中的条目pkmap_page_table是空闲的,但由于相应的 TLB 条目尚未刷新而无法使用。flush_all_zero_pkmaps( )
将它们的计数器重置为零,从page_address_htable哈希表中删除相应的元素,并对 的所有条目发出 TLB 刷新pkmap_page_table。
When the last counter in pkmap_count is reached, the search
restarts from the counter at index 0. Before continuing, however,
map_new_virtual( ) invokes the
flush_all_zero_pkmaps( )
function, which starts another scan of the counters, looking for
those that have the value 1. Each counter that has a value of 1
denotes an entry in pkmap_page_table that is free but cannot
be used because the corresponding TLB entry has not yet been
flushed. flush_all_zero_pkmaps( )
resets their counters to zero, deletes the corresponding elements
from the page_address_htable hash
table, and issues TLB flushes on all entries of pkmap_page_table.
如果内部循环在 中找不到空计数器pkmap_count,则该map_new_virtual( )函数将阻塞当前进程,直到其他进程释放
pkmap_page_table页表的条目。这是通过插入等待current
队列pkmap_map_wait、将current状态设置为
TASK_UNINTERRUPTIBLE并调用schedule( )释放 CPU 来实现的。一旦进程被唤醒,该函数就会通过调用 来检查是否有其他进程已经映射了该页面
page_address( );如果还没有其他进程映射该页面,则重新启动内部循环。
If the inner loop cannot find a null counter in pkmap_count, the map_new_virtual( ) function blocks the
current process until some other process releases an entry of the
pkmap_page_table Page Table. This
is achieved by inserting current
in the pkmap_map_wait wait queue,
setting the current state to
TASK_UNINTERRUPTIBLE, and
invoking schedule( ) to
relinquish the CPU. Once the process is awakened, the function
checks whether another process has mapped the page by invoking
page_address( ); if no other
process has mapped the page yet, the inner loop is restarted.
该kunmap( )函数破坏了先前由 建立的永久内核映射
kmap( )。如果该页确实位于高内存区域,则会调用该kunmap_high( )函数,该函数本质上相当于以下代码:
The kunmap( ) function
destroys a permanent kernel mapping established previously by
kmap( ). If the page is really in
the high memory zone, it invokes the kunmap_high( ) function, which is
essentially equivalent to the following code:
void kunmap_high(结构页 * 页)
{
spin_lock(&kmap_lock);
if ((--pkmap_count[((无符号长))page_address(页)
-PKMAP_BASE)>>PAGE_SHIFT]) == 1)
if (waitqueue_active(&pkmap_map_wait))
wake_up(&pkmap_map_wait);
spin_unlock(&kmap_lock);
}void kunmap_high(struct page * page)
{
spin_lock(&kmap_lock);
if ((--pkmap_count[((unsigned long)page_address(page)
-PKMAP_BASE)>>PAGE_SHIFT]) == 1)
if (waitqueue_active(&pkmap_map_wait))
wake_up(&pkmap_map_wait);
spin_unlock(&kmap_lock);
}括号内的表达式
pkmap_count从页的线性地址计算数组的索引。计数器减少并与 1 进行比较。成功的比较表明没有进程正在使用该页面。该函数最终可以唤醒由 填充的等待队列中的进程map_new_virtual( )(如果有)。
The expression within the brackets computes the index into the
pkmap_count array from the page's
linear address. The counter is decreased and compared to 1. A
successful comparison indicates that no process is using the page.
The function can finally wake up processes in the wait queue filled
by map_new_virtual( ), if
any.
临时内核映射比永久内核映射更容易实现;此外,它们可以在中断处理程序和可延迟函数中使用,因为请求临时内核映射永远不会阻塞当前进程。
Temporary kernel mappings are simpler to implement than permanent kernel mappings; moreover, they can be used inside interrupt handlers and deferrable functions, because requesting a temporary kernel mapping never blocks the current process.
高端内存中的每个页帧都可以通过内核地址空间中的窗口进行映射 ,即为此目的保留的页表条目。为临时内核映射保留的窗口数量是相当小的。
Every page frame in high memory can be mapped through a window in the kernel address space—namely, a Page Table entry that is reserved for this purpose. The number of windows reserved for temporary kernel mappings is quite small.
每个CPU都有自己的一组13个窗口,由
enum km_type数据结构表示。此数据结构中定义的每个符号(例如KM_BOUNCE_READ、KM_USER0或 )KM_PTE0标识窗口的线性地址。
Each CPU has its own set of 13 windows, represented by the
enum km_type data structure. Each
symbol defined in this data structure—such as KM_BOUNCE_READ, KM_USER0, or KM_PTE0—identifies the linear address of a
window.
内核必须确保同一个窗口不会同时被两个内核控制路径使用。因此,
km_type结构中的每个符号专用于一个内核组件,并以该组件命名。最后一个符号KM_TYPE_NR本身并不代表线性地址,但产生每个 CPU 可用的不同窗口的数量。
The kernel must ensure that the same window is never used by
two kernel control paths at the same time. Thus, each symbol in the
km_type structure is dedicated to
one kernel component and is named after the component. The last
symbol, KM_TYPE_NR, does not
represent a linear address by itself, but yields the number of
different windows usable by every CPU.
中的每个符号km_type(除了最后一个)都是固定映射线性地址的索引(请参阅第 2 章中的“固定映射线性地址”一节)。数据结构包括符号和;后者被分配给索引。这样就有了固定映射的线性地址enum fixed_addressesFIX_KMAP_BEGINFIX_KMAP_ENDFIX_KMAP_BEGIN + (KM_TYPE_NR *
NR_CPUS) - 1KM_TYPE_NR对于系统中的每个 CPU。此外,内核使用与线性地址kmap_pte对应的页表条目的地址来初始化该变量
。fix_to_virt(FIX_KMAP_BEGIN
Each symbol in km_type,
except the last one, is an index of a fix-mapped linear address (see
the section "Fix-Mapped
Linear Addresses" in Chapter 2). The enum fixed_addresses data structure
includes the symbols FIX_KMAP_BEGIN and FIX_KMAP_END; the latter is assigned to
the index FIX_KMAP_BEGIN + (KM_TYPE_NR *
NR_CPUS) - 1. In this manner, there are KM_TYPE_NR fix-mapped linear
addresses for each CPU in the system. Furthermore, the kernel
initializes the kmap_pte variable
with the address of the Page Table entry corresponding to the
fix_to_virt(FIX_KMAP_BEGIN)
linear address.
为了建立临时内核映射,内核调用该kmap_atomic( )函数,该函数本质上相当于以下代码:
To establish a temporary kernel mapping, the kernel invokes
the kmap_atomic( ) function,
which is essentially equivalent to the following code:
void * kmap_atomic(struct page * page, enum km_type 类型)
{
枚举固定地址 idx;
无符号长 vaddr;
current_thread_info( )->preempt_count++;
if (!PageHighMem(页))
返回页面地址(页面);
idx = 类型 + KM_TYPE_NR * smp_processor_id( );
vaddr = fix_to_virt(FIX_KMAP_BEGIN + idx);
set_pte(kmap_pte-idx, mk_pte(页面, 0x063));
_ _flush_tlb_single(vaddr);
返回(无效*)vaddr;
}void * kmap_atomic(struct page * page, enum km_type type)
{
enum fixed_addresses idx;
unsigned long vaddr;
current_thread_info( )->preempt_count++;
if (!PageHighMem(page))
return page_address(page);
idx = type + KM_TYPE_NR * smp_processor_id( );
vaddr = fix_to_virt(FIX_KMAP_BEGIN + idx);
set_pte(kmap_pte-idx, mk_pte(page, 0x063));
_ _flush_tlb_single(vaddr);
return (void *) vaddr;
}该type参数和通过检索的 CPU 标识符smp_processor_id( )指定必须使用哪个固定映射线性地址来映射请求页面。如果该页框不属于高端内存,则该函数返回该页框的线性地址;否则,它使用页的物理地址和位 、 、Present和Accessed来Read/Write设置与固定映射线性地址相对应的页表条目Dirty。最后,该函数刷新正确的 TLB 条目并返回线性地址。
The type argument and the
CPU identifier retrieved through smp_processor_id( ) specify what
fix-mapped linear address has to be used to map the request page.
The function returns the linear address of the page frame if it
doesn't belong to high memory; otherwise, it sets up the Page Table
entry corresponding to the fix-mapped linear address with the page's
physical address and the bits Present, Accessed, Read/Write, and Dirty. Finally, the function flushes the
proper TLB entry and returns the linear address.
为了销毁临时内核映射,内核使用该
kunmap_atomic( )函数。在80×86架构中,该函数减少preempt_count当前进程的数量;因此,如果内核控制路径在需要临时内核映射之前是可抢占的,那么在它破坏相同的映射之后它将再次可抢占。此外,检查 的标志kunmap_atomic( )是否
已设置,如果是,则调用
。TIF_NEED_RESCHEDcurrentschedule( )
To destroy a temporary kernel mapping, the kernel uses the
kunmap_atomic( ) function. In the
80 × 86 architecture, this function decreases the preempt_count of the current process;
thus, if the kernel control path was preemptable right before
requiring a temporary kernel mapping, it will be preemptable again
after it has destroyed the same mapping. Moreover, kunmap_atomic( ) checks whether the
TIF_NEED_RESCHED flag of current is set and, if so, invokes
schedule( ).
内核必须建立一个健壮且有效的策略来分配连续页框组。这样做时,它必须处理一个众所周知的内存管理问题,称为 外部碎片:频繁请求和释放不同大小的连续页框组可能会导致几个小块空闲页框“分散”的情况已分配页框的内部块。因此,即使有足够的空闲页来满足请求,也可能无法分配大块的连续页框。
The kernel must establish a robust and efficient strategy for allocating groups of contiguous page frames. In doing so, it must deal with a well-known memory management problem called external fragmentation: frequent requests and releases of groups of contiguous page frames of different sizes may lead to a situation in which several small blocks of free page frames are "scattered" inside blocks of allocated page frames. As a result, it may become impossible to allocate a large block of contiguous page frames, even if there are enough free pages to satisfy the request.
基本上有两种方法可以避免外部碎片:
There are essentially two ways to avoid external fragmentation:
使用分页电路将不连续的空闲页帧组映射到连续线性地址的间隔中。
Use the paging circuitry to map groups of noncontiguous free page frames into intervals of contiguous linear addresses.
开发一种合适的技术来跟踪现有的空闲连续页框块,尽可能避免分割大空闲块以满足对较小空闲块的请求的需要。
Develop a suitable technique to keep track of the existing blocks of free contiguous page frames, avoiding as much as possible the need to split up a large free block to satisfy a request for a smaller one.
内核首选第二种方法有以下三个原因:
The second approach is preferred by the kernel for three good reasons:
在某些情况下,连续的页框确实是必要的,因为连续的线性地址不足以满足请求。一个典型的例子是分配给 DMA 处理器的缓冲区的内存请求(参见第 13 章)。由于大多数 DMA 在单个 I/O 操作中传输多个磁盘扇区时会忽略分页电路并直接访问地址总线,因此请求的缓冲区必须位于连续的页帧中。
In some cases, contiguous page frames are really necessary, because contiguous linear addresses are not sufficient to satisfy the request. A typical example is a memory request for buffers to be assigned to a DMA processor (see Chapter 13). Because most DMAs ignore the paging circuitry and access the address bus directly while transferring several disk sectors in a single I/O operation, the buffers requested must be located in contiguous page frames.
即使连续页帧分配并不是绝对必要的,它也提供了保持内核分页表不变的巨大优势。修改页表有什么问题?正如我们从第 2 章中知道的那样,频繁的页表修改会导致平均内存访问时间更长,因为它们会使 CPU 刷新转换后备缓冲区的内容。
Even if contiguous page frame allocation is not strictly necessary, it offers the big advantage of leaving the kernel paging tables unchanged. What's wrong with modifying the Page Tables? As we know from Chapter 2, frequent Page Table modifications lead to higher average memory access times, because they make the CPU flush the contents of the translation lookaside buffers.
内核可以通过 4 MB 页面访问大块连续物理内存。这减少了转换后备缓冲区未命中的情况,从而显着加快了平均内存访问时间(请参阅第 2 章中的“转换后备缓冲区(TLB) ”部分)。
Large chunks of contiguous physical memory can be accessed by the kernel through 4 MB pages. This reduces the translation lookaside buffers misses, thus significantly speeding up the average memory access time (see the section "Translation Lookaside Buffers (TLB)" in Chapter 2).
Linux解决外部碎片问题所采用的技术是基于著名的伙伴系统算法。所有空闲页框都分为 11 个块列表,分别包含 1、2、4、8、16、32、64、128、256、512 和 1024 个连续页框组。1024 页帧的最大请求对应于 4 MB 的连续 RAM 块。块的第一个页帧的物理地址是组大小的倍数,例如,16 页帧块的初始地址是 16 × 2 12的倍数(2 12 = 4,096 ,这是常规页面大小)。
The technique adopted by Linux to solve the external fragmentation problem is based on the well-known buddy system algorithm. All free page frames are grouped into 11 lists of blocks that contain groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames, respectively. The largest request of 1024 page frames corresponds to a chunk of 4 MB of contiguous RAM. The physical address of the first page frame of a block is a multiple of the group size—for example, the initial address of a 16-page-frame block is a multiple of 16 × 212 (212 = 4,096, which is the regular page size).
我们将通过一个简单的示例展示该算法的工作原理:
We'll show how the algorithm works through a simple example:
假设有一个对一组 256 个连续页帧(即 1 兆字节)的请求。该算法首先检查 256 页帧列表中是否存在空闲块。如果不存在这样的块,算法将查找下一个更大的块——512页帧列表中的空闲块。如果存在这样的块,内核会分配 512 个页框中的 256 个来满足请求,并将剩余的 256 个页框插入到空闲 256 个页框块的列表中。如果没有空闲的 512 页块,内核就会查找下一个更大的块(即空闲的 1024 页帧块)。如果存在这样的块,它会分配 1024 个页帧中的 256 个来满足请求,将剩余 768 个页框的前 512 个插入到空闲 512 页框块列表中,并将最后 256 个页框插入到空闲 256 页框块列表中。如果 1024 页帧块列表为空,则算法放弃并发出错误情况信号。
Assume there is a request for a group of 256 contiguous page frames (i.e., one megabyte). The algorithm checks first to see whether a free block in the 256-page-frame list exists. If there is no such block, the algorithm looks for the next larger block—a free block in the 512-page-frame list. If such a block exists, the kernel allocates 256 of the 512 page frames to satisfy the request and inserts the remaining 256 page frames into the list of free 256-page-frame blocks. If there is no free 512-page block, the kernel then looks for the next larger block (i.e., a free 1024-page-frame block). If such a block exists, it allocates 256 of the 1024 page frames to satisfy the request, inserts the first 512 of the remaining 768 page frames into the list of free 512-page-frame blocks, and inserts the last 256 page frames into the list of free 256-page-frame blocks. If the list of 1024-page-frame blocks is empty, the algorithm gives up and signals an error condition.
释放页框块的反向操作由此产生了该算法的名称。内核尝试将大小为b的空闲伙伴块对合并为大小为 2 b的单个块。如果满足以下条件,两个块将被视为好友:
The reverse operation, releasing blocks of page frames, gives rise to the name of this algorithm. The kernel attempts to merge pairs of free buddy blocks of size b together into a single block of size 2b. Two blocks are considered buddies if:
两个块具有相同的大小,例如 b。
Both blocks have the same size, say b.
它们位于连续的物理地址中。
They are located in contiguous physical addresses.
第一个块的第一个页框的物理地址是2× b ×2 12的倍数。
The physical address of the first page frame of the first block is a multiple of 2 × b × 212.
该算法是迭代的;如果它成功合并释放的块,它就会加倍b并再次尝试以创建更大的块。
The algorithm is iterative; if it succeeds in merging released blocks, it doubles b and tries again so as to create even bigger blocks.
Linux 2.6 对每个区域使用不同的伙伴系统。因此,在 80 × 86 架构中,有 3 个伙伴系统:第一个处理适合 ISA DMA 的页帧,第二个处理“普通”页帧,第三个处理高内存页帧。每个伙伴系统依赖以下主要数据结构:
Linux 2.6 uses a different buddy system for each zone. Thus, in the 80 × 86 architecture, there are 3 buddy systems: the first handles the page frames suitable for ISA DMA, the second handles the "normal" page frames, and the third handles the high-memory page frames. Each buddy system relies on the following main data structures :
之前介绍过的数组mem_map。实际上,每个区域都与元素的一个子集有关mem_map
。子集中的第一个元素及其元素数量分别由区域描述符的zone_mem_map和字段指定。size
The mem_map array
introduced previously. Actually, each zone is concerned with a
subset of the mem_map
elements. The first element in the subset and its number of
elements are specified, respectively, by the zone_mem_map and size fields of the zone
descriptor.
一个由 11 个类型的元素组成的数组free_area,每个组大小一个元素。该数组存储在free_area区域描述符的字段中。
An array consisting of eleven elements of type free_area, one element for each group
size. The array is stored in the free_area field of the zone
descriptor.
让我们考虑区域描述符中数组的第 k
个元素,它标识大小为 2 kfree_area的所有空闲块。该元素的字段是双向链接循环列表的头部,该列表收集与 2 k页的空闲块关联的页描述符。更准确地说,这个列表包括2k个空闲页框的每个块的起始页框的页描述符;指向列表中相邻元素的指针存储在
页面描述符的字段中。[ * ]free_listlru
Let us consider the k
th element of the free_area array in the zone descriptor,
which identifies all the free blocks of size
2k. The free_list field of this element is the
head of a doubly linked circular list that collects the page
descriptors associated with the free blocks of
2k pages. More precisely, this list
includes the page descriptors of the starting page frame of every
block of 2k free page frames; the
pointers to the adjacent elements in the list are stored in the
lru field of the page
descriptor.[*]
除了列表的头部之外,数组的第 k个元素free_area还包括字段
,该字段指定大小为 2 knr_free页的空闲块的数量。当然,如果没有 2 k 个
空闲页框的块,则等于 0 并且列表为空(两个指针都指向字段本身)。nr_freefree_listfree_listfree_list
Besides the head of the list, the
kth element of the free_area array includes also the field
nr_free, which specifies the
number of free blocks of size 2k pages.
Of course, if there are no blocks of 2k
free page frames, nr_free is
equal to 0 and the free_list list
is empty (both pointers of free_list point to the free_list field itself).
最后, 2 kprivate个空闲页块中第一页的描述符字段存储该块的顺序,即数字k。由于这个字段,当一个页面块被释放时,内核可以确定该块的伙伴是否也空闲,如果是,它可以将两个块合并为一个 2 k+1 页的块。应该注意的是,直到 Linux 2.6.10,内核使用 10 个标志数组来编码此信息。
Finally, the private field
of the descriptor of the first page in a block of
2k free pages stores the order of the
block, that is, the number k. Thanks to this
field, when a block of pages is freed, the kernel can determine
whether the buddy of the block is also free and, if so, it can
coalesce the two blocks in a single block of
2k+1 pages. It should be noted that up to
Linux 2.6.10, the kernel used 10 arrays of flags to encode this
information.
该_ _rmqueue( )
函数用于查找区域中的空闲块。该函数有两个参数:区域描述符的地址,以及order,表示所请求的空闲页块大小的对数(0 表示一页块,1 表示两页块,依此类推) 。如果页框分配成功,__rmqueue(
)函数返回第一个分配的页框的页描述符的地址。否则,该函数返回
NULL。
The _ _rmqueue( )
function is used to find a free block in a zone. The function takes
two arguments: the address of the zone descriptor, and order, which denotes the logarithm of the
size of the requested block of free pages (0 for a one-page block, 1
for a two-page block, and so forth). If the page frames are
successfully allocated, the _ _rmqueue(
) function returns the address of the page descriptor of
the first allocated page frame. Otherwise, the function returns
NULL.
_ _rmqueue( )函数假设调用者已经禁用了本地中断并获取了zone->lock自旋锁,这可以保护伙伴系统的数据结构。它在每个列表中执行循环搜索以查找可用块(由不指向条目本身的条目表示),从请求的列表开始,并在必要时继续搜索更大的order订单:
The _ _rmqueue( ) function
assumes that the caller has already disabled local interrupts and
acquired the zone->lock spin
lock, which protects the data structures of the buddy system. It
performs a cyclic search through each list for an available block
(denoted by an entry that doesn't point to the entry itself),
starting with the list for the requested order and continuing if necessary to
larger orders:
结构自由区域*区域;
无符号整数当前顺序;
for (current_order=order; current_order<11; ++current_order) {
区域=区域->空闲区域+当前订单;
if (!list_empty(&area->free_list))
转到块_找到;
}
返回空值;struct free_area *area;
unsigned int current_order;
for (current_order=order; current_order<11; ++current_order) {
area = zone->free_area + current_order;
if (!list_empty(&area->free_list))
goto block_found;
}
return NULL;如果循环终止,则没有找到合适的空闲块,因此 _ _rmqueue( )返回一个
NULL值。否则,已找到合适的空闲块;在这种情况下,其第一个页框的描述符将从列表中删除,并且区域描述符中的值free_ pages减少:
If the loop terminates, no suitable free block has been found,
so _ _rmqueue( ) returns a
NULL value. Otherwise, a suitable
free block has been found; in this case, the descriptor of its first
page frame is removed from the list and the value of free_ pages in the zone descriptor is
decreased:
发现块:
page = list_entry(area->free_list.next, struct page, lru);
list_del(&page->lru);
清除页面私有(页面);
页面->私有= 0;
区域->nr_free--;
zone->free_pages -= 1UL << 顺序;block_found:
page = list_entry(area->free_list.next, struct page, lru);
list_del(&page->lru);
ClearPagePrivate(page);
page->private = 0;
area->nr_free--;
zone->free_pages -= 1UL << order;curr_order如果找到的块来自大小大于请求大小
的列表order,while则执行循环。这些代码行背后的基本原理如下:当需要使用 2 k个页框的块来满足对 2 h 个页框(h < k )的请求时,程序会分配前 2 h 个页框并迭代将最后 2 k - 2 h页框重新分配给free_area索引在
h和k之间的列表:
If the block found comes from a list of size curr_order greater than the requested size
order, a while cycle is executed. The rationale
behind these lines of codes is as follows: when it becomes necessary
to use a block of 2k page frames to
satisfy a request for 2h page frames
(h < k), the program
allocates the first 2h page frames and
iteratively reassigns the last 2k -
2h page frames to the free_area lists that have indexes between
h and k:
大小 = 1 << curr_order;
while (curr_order > 订单) {
区域 - ;
curr_order--;
大小>>= 1;
好友 = 页 + 大小;
/* 插入好友作为列表中的第一个元素 */
list_add(&buddy->lru, &area->free_list);
区域->nr_free++;
好友->私有= curr_order;
SetPagePrivate(好友);
}
返回页面; size = 1 << curr_order;
while (curr_order > order) {
area--;
curr_order--;
size >>= 1;
buddy = page + size;
/* insert buddy as first element in the list */
list_add(&buddy->lru, &area->free_list);
area->nr_free++;
buddy->private = curr_order;
SetPagePrivate(buddy);
}
return page;因为 __函数已经找到了合适的空闲块,所以它返回与第一个分配的页框关联的页描述符的rmqueue( )
地址
。page
Because the _ _rmqueue( )
function has found a suitable free block, it returns the address
page of the page descriptor
associated with the first allocated page frame.
该_ _free_pages_bulk(
)函数实现了释放页框的伙伴系统策略。它使用三个基本输入参数:[ * ]
The _ _free_pages_bulk(
) function implements the buddy system strategy for
freeing page frames. It uses three basic input parameters:[*]
pagepage待释放块包含的第一个页框描述符的地址
The address of the descriptor of the first page frame included in the block to be released
zonezone区域描述符的地址
The address of the zone descriptor
orderorder块的对数大小
The logarithmic size of the block
该函数假设调用者已经禁用本地中断并获取zone->lock自旋锁,这可以保护伙伴系统的数据结构。_
_free_pages_bulk( )首先声明并初始化一些局部变量:
The function assumes that the caller has already disabled
local interrupts and acquired the zone->lock spin lock, which protects
the data structure of the buddy system. _
_free_pages_bulk( ) starts by declaring and initializing a
few local variables:
struct page * base = zone->zone_mem_map; unsigned long buddy_idx, page_idx = 页 - 基址; struct page * buddy, * 合并; int order_size = 1 << 订单;
struct page * base = zone->zone_mem_map; unsigned long buddy_idx, page_idx = page - base; struct page * buddy, * coalesced; int order_size = 1 << order;
局部page_idx变量包含块中第一页框相对于区域第一页框的索引。
The page_idx local variable
contains the index of the first page frame in the block with respect
to the first page frame of the zone.
局部order_size变量用于增加区域中空闲页框的计数器:
The order_size local
variable is used to increase the counter of free page frames in the
zone:
区域->free_pages += order_size;
zone->free_pages += order_size;
该函数现在执行最多执行 10
order次的循环,每次将块与其伙伴合并的可能性一次。该函数从最小尺寸的块开始,向上移动到最大尺寸:
The function now performs a cycle executed at most 10-
order times, once for each
possibility for merging a block with its buddy. The function starts
with the smallest-sized block and moves up to the top size:
而(阶 < 10){
buddy_idx = page_idx ^ (1 << 顺序);
好友 = 基础 + buddy_idx;
if (!page_is_buddy(好友, 订单))
休息;
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
ClearPagePrivate(好友);
好友->私人 = 0;
page_idx &= buddy_idx;
订单++;
}while (order < 10) {
buddy_idx = page_idx ^ (1 << order);
buddy = base + buddy_idx;
if (!page_is_buddy(buddy, order))
break;
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
ClearPagePrivate(buddy);
buddy->private = 0;
page_idx &= buddy_idx;
order++;
}在循环体中,该函数查找块
buddy_idx的索引,该索引与具有页面描述符索引的块是伙伴page_idx。事实证明,这个指数可以很容易地计算为:
In the body of the loop, the function looks for the index
buddy_idx of the block, which is
buddy to the one having the page descriptor index page_idx. It turns out that this index can
be easily computed as:
buddy_idx = page_idx ^ (1 << 顺序);
buddy_idx = page_idx ^ (1 << order);
事实上,使用 ) 掩码的异或 (XOR)会切换 的第 - 位(1<<order的值。因此,如果该位先前为零,则等于+ ; 相反,如果该位先前为 1,则等于。orderpage_idxbuddy_idxpage_idxorder_sizebuddy_idxpage_idx -
order_size
In fact, an Exclusive OR (XOR) using the (1<<order) mask switches the value
of the order-th bit of page_idx. Therefore, if the bit was
previously zero, buddy_idx is
equal to page_idx+ order_size; conversely, if the bit was
previously one, buddy_idx is
equal to page_idx -
order_size.
一旦知道伙伴块索引,就可以轻松获得伙伴块的页面描述符,如下所示:
Once the buddy block index is known, the page descriptor of the buddy block can be easily obtained as:
好友 = 基础 + buddy_idx;
buddy = base + buddy_idx;
现在,该函数调用page_is_buddy()以检查是否buddy描述了空闲页框块的第一页order_size。
Now the function invokes page_is_buddy() to check if buddy describes the first page of a block
of order_size free page
frames.
int page_is_buddy(struct page *page, int order)
{
if (PagePrivate(buddy) && 页面->私人 == 订单 &&
!PageReserved(buddy) && page_count(page) ==0)
返回1;
返回0;
}int page_is_buddy(struct page *page, int order)
{
if (PagePrivate(buddy) && page->private == order &&
!PageReserved(buddy) && page_count(page) ==0)
return 1;
return 0;
}正如你所看到的,好友的第一页必须是空闲的(_count字段等于-1),它必须属于动态内存(PG_reserved位清除),它的
private字段必须是有意义的(PG_private位设置),最后该private字段必须存储页面的顺序。块被释放。
As you see, the buddy's first page must be free ( _count field equal to -1), it must belong to the dynamic memory
(PG_reserved bit clear), its
private field must be meaningful
(PG_private bit set), and finally
the private field must store the
order of the block being freed.
如果满足所有这些条件,则伙伴块是空闲的,并且该函数从 order 的空闲块列表中删除伙伴块order,并执行一次迭代以寻找两倍大的伙伴块。
If all these conditions are met, the buddy block is free and
the function removes the buddy block from the list of free blocks of
order order, and performs one
more iteration looking for buddy blocks twice as big.
如果至少其中一个条件page_is_buddy( )不满足,则该函数会跳出循环,因为获得的空闲块无法进一步与其他空闲块合并。该函数将其插入到正确的列表中,并private按照块大小的顺序更新第一个页框的字段:
If at least one of the conditions in page_is_buddy( ) is not met, the function
breaks out of the cycle, because the free block obtained cannot be
merged further with other free blocks. The function inserts it in
the proper list and updates the private field of the first page frame with
the order of the block size:
合并 = 基数 + page_idx; 合并->私有=顺序; SetPagePrivate(合并); list_add(&coalesced->lru, &zone->free_area[order].free_list); 区域->free_area[顺序].nr_free++;
coalesced = base + page_idx; coalesced->private = order; SetPagePrivate(coalesced); list_add(&coalesced->lru, &zone->free_area[order].free_list); zone->free_area[order].nr_free++;
正如我们将在本章后面看到的,内核经常请求和释放单页帧。为了提高系统性能,每个内存区域都定义了每个 CPU 的页帧缓存。每个 CPU 缓存都包含一些预分配的页框,用于本地 CPU 发出的单个内存请求。
As we will see later in this chapter, the kernel often requests and releases single page frames. To boost system performance, each memory zone defines a per-CPU page frame cache. Each per-CPU cache includes some pre-allocated page frames to be used for single memory requests issued by the local CPU.
实际上,每个内存区域和每个 CPU 有两个缓存:热缓存 ,它存储其内容可能包含在CPU的硬件缓存中的页帧,以及冷缓存 。
Actually, there are two caches for each memory zone and for each CPU: a hot cache , which stores page frames whose contents are likely to be included in the CPU's hardware cache, and a cold cache .
如果内核或用户模式进程在分配后立即写入页帧,则从热缓存中获取页帧有利于系统性能。事实上,对页帧的存储单元的每次访问都会导致硬件高速缓存的一行被从另一个页帧“窃取”——当然,除非硬件高速缓存已经包含映射“页帧”的单元的行。刚刚访问的“hot”页面框架。
Taking a page frame from the hot cache is beneficial for system performance if either the kernel or a User Mode process will write into the page frame right after the allocation. In fact, every access to a memory cell of the page frame will result in a line of the hardware cache being "stolen" from another page frame—unless, of course, the hardware cache already includes a line that maps the cell of the "hot" page frame just accessed.
相反,如果要通过 DMA 操作填充页帧,则从冷高速缓存中获取页帧会很方便。在这种情况下,CPU 不参与,硬件缓存的任何行都不会被修改。从冷缓存中取出页帧可以为其他类型的内存分配请求保留热页帧。
Conversely, taking a page frame from the cold cache is convenient if the page frame is going to be filled with a DMA operation. In this case, the CPU is not involved and no line of the hardware cache will be modified. Taking the page frame from the cold cache preserves the reserve of hot page frames for the other kinds of memory allocation requests.
实现每CPU页帧缓存的主要数据结构是per_cpu_pageset存储在pageset内存区域描述符字段中的数据结构数组。该数组包含每个 CPU 的一个元素;该元素又由两个per_cpu_pages描述符组成,一个用于热缓存,另一个用于冷缓存。描述符的字段per_cpu_pages列于表8-7中。
The main data structure implementing the per-CPU page frame
cache is an array of per_cpu_pageset data structures stored in
the pageset field of the memory
zone descriptor. The array includes one element for each CPU; this
element, in turn, consists of two per_cpu_pages descriptors, one for the hot
cache and the other for the cold cache. The fields of the per_cpu_pages descriptor are listed in Table 8-7.
表 8-7。per_cpu_pages 描述符的字段
Table 8-7. The fields of the per_cpu_pages descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 缓存中的页帧数 Number of pages frame in the cache |
| | 缓存补充的低水位线 Low watermark for cache replenishing |
| | 缓存耗尽的高水位线 High watermark for cache depletion |
| | 要从缓存中添加或减去的页框数 Number of page frames to be added or subtracted from the cache |
| | 缓存中包含的页框描述符列表 List of descriptors of the page frames included in the cache |
内核通过使用两个水位线来监视热缓存和冷缓存的大小:如果页框的数量低于水位线
low,则内核通过batch从伙伴系统分配单个页框来补充适当的缓存;否则,如果页帧的数量超过
high水位线,内核将batch缓存中的页帧释放给伙伴系统页帧。batch、low和的值high本质上取决于内存区域中包含的页框数量。
The kernel monitors the size of the both the hot and cold caches
by using two watermarks: if the number of page frames falls below the
low watermark, the kernel
replenishes the proper cache by allocating batch single page frames from the buddy
system; otherwise, if the number of page frames rises above the
high watermark, the kernel releases
to the buddy system batch page
frames in the cache. The values of batch, low, and high essentially depend on the number of
page frames included in the memory zone.
该buffered_rmqueue(
)函数在给定的内存区域中分配页框。它利用每个 CPU 的页帧缓存来处理单个页帧请求。
The buffered_rmqueue(
) function allocates page frames in a given memory zone.
It makes use of the per-CPU page frame caches to handle single page
frame requests.
参数是内存区域描述符的地址、内存分配请求的顺序order和分配标志gfp_flags。如果_ _GFP_COLD设置了该标志gfp_flags,则该页框应从冷缓存中获取,否则应从热缓存中获取(该标志仅对单页帧请求有意义)。该函数主要执行以下操作:
The parameters are the address of the memory zone descriptor,
the order of the memory allocation request order, and the allocation flags gfp_flags. If the _ _GFP_COLD flag is set in gfp_flags, the page frame should be taken
from the cold cache, otherwise it should be taken from the hot cache
(this flag is meaningful only for single page frame requests). The
function essentially executes the following operations:
如果order不等于0,则不能使用每CPU页帧缓存:函数跳转到步骤4。
If order is not equal
to 0, the per-CPU page frame cache cannot be used: the function
jumps to step 4.
检查由标志值标识的内存区域的本地每CPU缓存是否_
_GFP_COLD需要补充(描述count符的字段per_cpu_pages小于或等于该low字段)。在这种情况下,它执行以下子步骤:
batch
通过重复调用该函数从伙伴系统分配单页框架_ _rmqueue(
)。
将已分配页框的描述符插入缓存列表中。
count通过添加实际分配的页框数来更新 的值。
Checks whether the memory zone's local per-CPU cache
identified by the value of the _
_GFP_COLD flag has to be replenished (the count field of the per_cpu_pages descriptor is lower than
or equal to the low field).
In this case, it executes the following substeps:
Allocates batch
single page frames from the buddy system by repeatedly
invoking the _ _rmqueue(
) function.
Inserts the descriptors of the allocated page frames in the cache's list.
Updates the value of count by adding the number of page
frames actually allocated.
如果count为正,则该函数从缓存列表中获取一个页框,减少
count,然后跳转到步骤 5。(请注意,每个 CPU 的页框缓存可能为空;当_ _rmqueue(
)步骤 2a 中调用的函数无法分配任何页框时,就会发生这种情况。页框。)
If count is positive,
the function gets a page frame from the cache's list, decreases
count, and jumps to step 5.
(Observe that a per-CPU page frame cache could be empty; this
happens when the _ _rmqueue(
) function invoked in step 2a fails to allocate any
page frames.)
这里,内存请求尚未得到满足,要么是因为请求跨越了几个连续的页框,要么是因为所选择的页框高速缓存是空的。调用该
_ _rmqueue( )函数从伙伴系统分配请求的页框。
Here, the memory request has not yet been satisfied,
either because the request spans several contiguous page frames,
or because the selected page frame cache is empty. Invokes the
_ _rmqueue( ) function to
allocate the requested page frames from the buddy system.
如果内存请求已得到满足,该函数将初始化(第一个)页框的页描述符:清除一些标志,将字段设置为零private,并将页框引用计数器设置为1。此外,如果设置了_ _GPF_ZERO标志 in ,则会用零填充分配的内存区域。gfp_flags
If the memory request has been satisfied, the function
initializes the page descriptor of the (first) page frame:
clears some flags, sets the private field to zero, and sets the
page frame reference counter to one. Moreover, if the _ _GPF_ZERO flag in gfp_flags is set, it fills the
allocated memory area with zeros.
返回(第一个)页框的页描述符地址,或者NULL如果内存分配请求失败。
Returns the page descriptor address of the (first) page
frame, or NULL if the memory
allocation request failed.
为了将单个页帧释放到每个 CPU 的页帧缓存,内核使用了 和free_hot_page( )函数free_cold_page( )。它们都是该函数的简单包装器,该函数接收要释放的页帧的free_hot_cold_page( )描述符地址和指定热缓存或冷缓存的标志作为其参数。pagecold
In order to release a single page frame to a per-CPU page
frame cache, the kernel makes use of the free_hot_page( ) and free_cold_page( ) functions. Both of them
are simple wrappers for the free_hot_cold_page( ) function, which
receives as its parameters the descriptor address page of the page frame to be released and
a cold flag specifying either the
hot cache or the cold cache.
该free_hot_cold_page( )
函数执行以下操作:
The free_hot_cold_page( )
function executes the following operations:
从该字段获取包括页框在内的内存区域描述符的地址(请参阅前面的“非统一内存访问(NUMA)page->flags ”部分)。
Gets from the page->flags field the address of
the memory zone descriptor including the page frame (see the
earlier section "Non-Uniform Memory Access
(NUMA)").
per_cpu_pages获取由标志选择的区域缓存描述符的地址cold
。
Gets the address of the per_cpu_pages descriptor of the zone's
cache selected by the cold
flag.
检查缓存是否应被耗尽:如果count高于或等于
high,则调用该free_pages_bulk( )函数,向其传递区域描述符、要释放的页框数(batch字段)、缓存列表的地址和数字零(对于 0 阶页框)。反过来,后一个函数重复调用该_ _free_pages_bulk( )
函数以将指定数量的页帧(从缓存列表中获取)释放到内存区域的伙伴系统。
Checks whether the cache should be depleted: if count is higher than or equal to
high, invokes the free_pages_bulk( ) function, passing
to it the zone descriptor, the number of page frames to be
released (batch field), the
address of the cache's list, and the number zero (for 0-order
page frames). In turn, the latter function invokes repeatedly
the _ _free_pages_bulk( )
function to releases the specified number of page frames—taken
from the cache's list—to the buddy system of the memory
zone.
将要释放的页框添加到缓存列表中,并增加字段count
。
Adds the page frame to be released to the cache's list,
and increases the count
field.
应该注意的是,在当前版本的 Linux 2.6 内核中,没有页帧被释放到冷缓存:内核总是假设释放的页帧相对于硬件缓存是热的。buffered_rmqueue( )当然,这并不意味着冷缓存是空的:当达到低水位线时,缓存就会被补充。
It should be noted that in the current version of the Linux
2.6 kernel, no page frame is ever released to the cold cache: the
kernel always assumes the freed page frame is hot with respect to
the hardware cache. Of course, this does not mean that the cold
cache is empty: the cache is replenished by buffered_rmqueue( ) when the low watermark
has been reached.
区域分配器 是内核页框分配器的前端。该组件必须找到一个内存区域,其中包含许多足够大的空闲页框以满足内存请求。这项任务并不像乍一看那么简单,因为区域分配器必须满足几个目标:
The zone allocator is the frontend of the kernel page frame allocator. This component must locate a memory zone that includes a number of free page frames large enough to satisfy the memory request. This task is not as simple as it could appear at a first glance, because the zone allocator must satisfy several goals:
它应该保护保留页框池(请参阅前面的“保留页框池”部分)。
It should protect the pool of reserved page frames (see the earlier section "The Pool of Reserved Page Frames").
当内存不足并且允许阻塞当前进程时,应该触发页框回收算法(参见 第17章);一旦释放了一些页框,区域分配器将重试分配。
It should trigger the page frame reclaiming algorithm (see Chapter 17) when memory is scarce and blocking the current process is allowed; once some page frames have been freed, the zone allocator will retry the allocation.
如果可能的话,它应该保留小而宝贵的ZONE_DMA内存区域。ZONE_DMA例如,如果请求是针对ZONE_NORMAL或页面框架,则区域分配器应该有点不愿意在内存区域中分配页面框架ZONE_HIGHMEM。
It should preserve the small, precious ZONE_DMA memory zone, if possible. For
instance, the zone allocator should be somewhat reluctant to
assign page frames in the ZONE_DMA memory zone if the request was
for ZONE_NORMAL or ZONE_HIGHMEM page frames.
我们在前面的“分区页框分配器”部分中已经看到,对一组连续页框的每个请求最终都是通过执行alloc_pages宏来处理的。该宏最终会调用该_ _alloc_pages( )
函数,该函数是区域分配器的核心。它接收三个参数:
We have seen in the earlier section "The Zoned Page Frame
Allocator" that every request for a group of contiguous page
frames is eventually handled by executing the alloc_pages macro. This macro, in turn, ends
up invoking the _ _alloc_pages( )
function, which is the core of the zone allocator. It receives three
parameters:
gfp_maskgfp_mask内存分配请求中指定的标志(参见前面的表8-5)
The flags specified in the memory allocation request (see earlier Table 8-5)
orderorder要分配的连续页框组的对数大小
The logarithmic size of the group of contiguous page frames to be allocated
zonelistzonelist指向zonelist
数据结构的指针,按优先顺序描述适合内存分配的内存区域
Pointer to a zonelist
data structure describing, in order of preference, the memory
zones suitable for the memory allocation
该_ _alloc_pages( )
函数扫描zonelist数据结构中包含的每个内存区域。执行此操作的代码如下所示:
The _ _alloc_pages( )
function scans every memory zone included in the zonelist data structure. The code that does
this looks like the following:
for (i = 0; (z=zonelist->zones[i]) != NULL; i++) {
if (zone_watermark_ok(z, 顺序, ...)) {
页 = buffered_rmqueue(z, 顺序, gfp_mask);
如果(页)
返回页面;
}
}for (i = 0; (z=zonelist->zones[i]) != NULL; i++) {
if (zone_watermark_ok(z, order, ...)) {
page = buffered_rmqueue(z, order, gfp_mask);
if (page)
return page;
}
}对于每个内存区域,该函数将空闲页帧的数量与阈值进行比较,该阈值取决于内存分配标志、当前进程的类型以及函数已检查该区域的次数。事实上,如果可用内存稀缺,则通常会扫描每个内存区域多次,每次都使用分配所需的最小可用内存量的较低阈值。因此,前面的代码块在函数体中被复制了几次(只有微小的变化)
_ _alloc_pages( )。该
buffered_rmqueue( )功能已在前面的“ Per-CPU 页帧缓存”部分中描述过:" 它返回第一个分配的页帧的页描述符,或者NULL如果内存区域不包含一组请求大小的连续页帧。
For each memory zone, the function compares the number of free
page frames with a threshold value that depends on the memory
allocation flags, on the type of current process, and on how many
times the zone has already been checked by the function. In fact, if
free memory is scarce, every memory zone is typically scanned several
times, each time with lower threshold on the minimal amount of free
memory required for the allocation. The previous block of code is thus
replicated several times—with minor variations—in the body of the
_ _alloc_pages( ) function. The
buffered_rmqueue( ) function has
been described already in the earlier section "The Per-CPU Page Frame
Cache:" it returns the page descriptor of the first allocated
page frame, or NULL if the memory
zone does not include a group of contiguous page frames of the
requested size.
辅助zone_watermark_ok( )
函数接收几个参数,这些参数确定min内存区域中空闲页框数量的阈值。特别是,如果满足以下两个条件,该函数将返回值 1:
The zone_watermark_ok( )
auxiliary function receives several parameters, which determine a
threshold min on the number of free
page frames in the memory zone. In particular, the function returns
the value 1 if the following two conditions are met:
除了要分配的页框外,内存区域中至少还有
min空闲页框,不包括低内存保留中的页框(lowmem_reserve区域描述符的字段)。
Besides the page frames to be allocated, there are at least
min free page frames in the
memory zone, not including the page frames in the low-on-memory
reserve (lowmem_reserve field
of the zone descriptor).
除了要分配的页框外,至少还有
顺序块中的空闲页帧至少为k,每个k
介于 1 和分配顺序之间。因此,如果
order大于零,则min/2大小至少为 2 的块中必须至少有空闲页框;如果order大于 1,则min/4大小至少为 4 的块中必须至少有空闲页框;等等。
Besides the page frames to be allocated, there are at least
free page frames in blocks of order at
least k, for each k
between 1 and the order of the allocation. Therefore, if order is greater than zero, there must
be at least min/2 free page
frames in blocks of size at least 2; if order is greater than one, there must be
at least min/4 free page frames
in blocks of size at least 4; and so on.
阈值的min确定zone_watermark_ok( )如下:
The value of the threshold min is determined by zone_watermark_ok( ) as follows:
pages_min基值作为函数的参数传递,并且可以是、
pages_low和区域的水印之一(请参阅本章前面的“保留页帧池pages_high”部分)。
The base value is passed as a parameter of the function and
can be one of the pages_min,
pages_low, and pages_high zone's watermarks (see the
section "The Pool of
Reserved Page Frames" earlier in this chapter).
gfp_high如果设置了作为参数传递的标志,则基值除以二。_ _GFP_HIGHMEM通常,如果在 中设置了
该标志gfp_mask,即如果可以从高端内存分配页框,则该标志等于 1 。
The base value is divided by two if the gfp_high flag passed as parameter is
set. Usually, this flag is equal to one if the _ _GFP_HIGHMEM flag is set in the
gfp_mask, that is, if the page
frames can be allocated from high memory.
can_try_harder如果设置了作为参数传递的标志,则阈值会进一步减小四分之一
。_ _GFP_WAIT如果在 中设置了
gfp_mask该标志,或者当前进程是实时进程并且内存分配是在进程上下文中(在中断处理程序和可延迟函数之外)完成的,则该标志通常等于 1 。
The threshold value is further reduced by one-fourth if the
can_try_harder flag passed as
parameter is set. This flag is usually equal to one if either the
_ _GFP_WAIT flag is set in
gfp_mask, or if the current
process is a real-time process and the memory allocation is done
in process context (outside of interrupt handlers and deferrable
functions).
该_ _alloc_pages( )
函数主要执行以下步骤:
The _ _alloc_pages( )
function essentially executes the following steps:
执行内存区域的第一次扫描(请参阅前面显示的代码块)。在第一次扫描中,min阈值设置为z->pages_low,其中z指向正在分析的区域描述符(can_try_harder
和gfp_high参数设置为零)。
Performs a first scanning of the memory zones (see the block
of code shown earlier). In this first scan, the min threshold value is set to z->pages_low, where z points to the zone descriptor being
analyzed (the can_try_harder
and gfp_high parameters are set
to zero).
如果函数在上一步中没有终止,则剩余的可用内存所剩无几:函数会唤醒 kswapd 内核线程开始异步回收页帧(参见第 17 章)。
If the function did not terminate in the previous step, there is not much free memory left: the function awakens the kswapd kernel threads to start reclaiming page frames asynchronously (see Chapter 17).
对内存区域执行第二次扫描,传递值作为基本阈值z->pages_min。如前所述,实际阈值也由
can_try_harder和gfp_high标志确定。此步骤几乎与步骤 1 相同,只是该函数使用较低的阈值。
Performs a second scanning of the memory zones, passing as
base threshold the value z->pages_min. As explained
previously, the actual threshold is determined also by the
can_try_harder and gfp_high flags. This step is nearly
identical to step 1, except that the function is using a lower
threshold.
如果该函数没有在上一步中终止,则系统内存肯定不足。如果发出内存分配请求的内核控制路径不是中断处理程序或可延迟函数,并且它正在尝试回收页帧(设置了标志PF_MEMALLOC
或PF_MEMDIE标志
current),则该函数将执行第三次内存扫描区域,尝试分配页框,忽略内存不足阈值,即不调用zone_watermark_ok( ). 这是允许内核控制路径耗尽由指定的页面的低内存保留的唯一情况。lowmem_reserve区域描述符的字段。事实上,在这种情况下,发出内存请求的内核控制路径最终会尝试释放页帧,因此如果可能的话,它应该得到它所请求的内容。如果没有内存区域包含足够的页框,则该函数返回NULL以通知调用者失败。
If the function did not terminate in the previous step, the
system is definitely low on memory. If the kernel control path
that issued the memory allocation request is not an interrupt
handler or a deferrable function and it is trying to reclaim page
frames (either the PF_MEMALLOC
flag or the PF_MEMDIE flag of
current is set), the function
then performs a third scanning of the memory zones, trying to
allocate the page frames ignoring the low-on-memory
thresholds—that is, without invoking zone_watermark_ok( ). This is the only
case where the kernel control path is allowed to deplete the
low-on-memory reserve of pages specified by the lowmem_reserve field of the zone
descriptor. In fact, in this case the kernel control path that
issued the memory request is ultimately trying to free page
frames, thus it should get what it has requested, if at all
possible. If no memory zone includes enough page frames, the
function returns NULL to notify
the caller of the failure.
这里,调用内核控制路径并不试图回收内存。如果未设置_
_GFP_WAIT标志gfp_mask,则函数返回NULL以通知内核控制路径内存分配失败:在这种情况下,无法在不阻塞当前进程的情况下满足请求。
Here, the invoking kernel control path is not trying to
reclaim memory. If the _
_GFP_WAIT flag of gfp_mask is not set, the function
returns NULL to notify the
kernel control path of the memory allocation failure: in this
case, there is no way to satisfy the request without blocking the
current process.
这里当前进程可以被阻塞:调用cond_resched()来检查其他进程是否需要CPU。
Here the current process can be blocked: invokes cond_resched() to check whether some
other process needs the CPU.
设置PF_MEMALLOC标志current, 来表示进程已准备好执行内存回收。
Sets the PF_MEMALLOC flag
of current, to denote the fact
that the process is ready to perform memory reclaiming.
存储在current->reclaim_state指向结构的指针
中reclaim_state。该结构仅包含一个字段 ,reclaimed_slab初始化为零(我们将在本章后面的“将 Slab 分配器与分区页框分配器连接”一节中了解如何使用该字段)。
Stores in current->reclaim_state a pointer to a
reclaim_state structure. This
structure includes just one field, reclaimed_slab, initialized to zero
(we'll see how this field is used in the section "Interfacing the Slab
Allocator with the Zoned Page Frame Allocator" later in
this chapter).
调用try_to_free_pages(
)查找一些要回收的页框(请参阅第 17 章中的“低内存回收”部分)。后一个函数可能会阻塞当前进程。一旦该函数返回,
就会重置的标志
并再次调用
。_ _alloc_pages( )PF_MEMALLOCcurrentcond_resched()
Invokes try_to_free_pages(
) to look for some page frames to be reclaimed (see the
section "Low On
Memory Reclaiming" in Chapter 17). The latter
function may block the current process. Once that function
returns, _ _alloc_pages( )
resets the PF_MEMALLOC flag of
current and invokes once more
cond_resched().
如果上一步已释放了一些页框,则该函数将执行与步骤 3 中执行的扫描相同的另一次内存区域扫描。如果无法满足内存分配请求,则该函数将确定是否应继续扫描内存区域:如果该_ _GFP_NORETRY标志已清除,并且内存分配请求跨越最多八个页帧,或者设置了_ _GFP_REPEAT和
标志之一,则该函数调用以使进程休眠一段时间(请参阅第 14 章),然后跳回步骤 6否则,函数返回以通知调用者内存分配失败。_ _GFP_NOFAILblk_congestion_wait(
)NULL
If the previous step has freed some page frames, the
function performs yet another scanning of the memory zones equal
to the one performed in step 3. If the memory allocation request
cannot be satisfied, the function determines whether it should
continue scanning the memory zone: if the _ _GFP_NORETRY flag is clear and either
the memory allocation request spans up to eight page frames, or
one of the _ _GFP_REPEAT and
_ _GFP_NOFAIL flags is set, the
function invokes blk_congestion_wait(
) to put the process asleep for awhile (see Chapter 14), and it jumps
back to step 6. Otherwise, the function returns NULL to notify the caller that the
memory allocation failed.
如果在步骤 9 中没有释放任何页框,则内核将陷入严重的麻烦,因为可用内存非常低,并且不可能回收任何页框。也许现在是做出关键决定的时候了。如果允许内核控制路径执行杀死进程所需的与文件系统相关的操作(已设置_ _GFP_FS标志
gfp_mask)并且该
_ _GFP_NORETRY标志已清除,则执行以下子步骤:
由于步骤 11a 中使用的水印比先前扫描中使用的水印高得多,因此该步骤很可能会失败。实际上,只有当另一个内核控制路径已经终止进程以回收其内存时,步骤 11a 才会成功。因此,步骤 11a 避免了杀死两个无辜进程而不是一个进程。
If no page frame has been freed in step 9, the kernel is in
deep trouble, because free memory is dangerously low and it was
not possible to reclaim any page frame. Perhaps the time has come
to take a crucial decision. If the kernel control path is allowed
to perform the filesystem-dependent operations needed to kill a
process (the _ _GFP_FS flag in
gfp_mask is set) and the
_ _GFP_NORETRY flag is clear,
performs the following substeps:
Scans once again the memory zones with a threshold value
equal to z->pages_high.
Invokes out_of_memory() to start freeing
some memory by killing a victim process (see "The Out of Memory
Killer" in Chapter
17).
Jumps back to step 1.
Because the watermark used in step 11a is much higher than the watermarks used in the previous scannings, that step is likely to fail. Actually, step 11a succeeds only if another kernel control path is already killing a process to reclaim its memory. Thus, step 11a avoids that two innocent processes are killed instead of one.
区域分配器还负责释放页框;值得庆幸的是,释放内存比分配内存容易得多。
The zone allocator also takes care of releasing page frames; thankfully, releasing memory is a lot easier than allocating it.
所有释放页框的内核宏和函数(在前面的“分区页框分配器”部分中描述)都依赖于该_
_free_pages( )函数。它接收要释放的第一个页帧的页描述符的地址 ( page) 和要释放的连续页帧组的对数大小 ( order) 作为其参数。该函数执行以下步骤:
All kernel macros and functions that release page
frames—described in the earlier section "The Zoned Page Frame
Allocator"—rely on the _
_free_pages( ) function. It receives as its parameters the
address of the page descriptor of the first page frame to be
released (page), and the
logarithmic size of the group of contiguous page frames to be
released (order). The function
executes the following steps:
检查第一个页框是否确实属于动态内存(其PG_reserved标志被清除);如果不是,则终止。
Checks that the first page frame really belongs to dynamic
memory (its PG_reserved flag
is cleared); if not, terminates.
减少page->_count使用计数器;如果它仍然大于或等于零,则终止。
Decreases the page->_count usage counter; if it
is still greater than or equal to zero, terminates.
如果order等于 0,则该函数调用free_hot_page( )将页帧释放到适当内存区域的每 CPU 热缓存(请参阅前面的部分“每 CPU 页帧缓存”)。
If order is equal to
zero, the function invokes free_hot_page( ) to release the page
frame to the per-CPU hot cache of the proper memory zone (see
the earlier section "The Per-CPU Page Frame
Cache").
如果order大于零,则将页帧添加到本地列表中,并调用函数free_pages_bulk( )
将它们释放到适当内存区域的伙伴系统中(请参阅free_hot_cold_page( )前面部分“每CPU页帧”中的描述中的步骤3缓存”)。
If order is greater
than zero, it adds the page frames in a local list and invokes
the free_pages_bulk( )
function to release them to the buddy system of the proper
memory zone (see step 3 in the description of free_hot_cold_page( ) in the earlier
section "The Per-CPU
Page Frame Cache").
[ * ]此外,Linux 内核甚至对一些物理地址空间中存在巨大“漏洞”的特殊单处理器系统也使用 NUMA。内核通过将有效物理地址的连续子范围分配给不同的内存节点来处理这些架构。
[*] Furthermore, the Linux kernel makes use of NUMA even for some peculiar uniprocessor systems that have huge "holes" in the physical address space. The kernel handles these architectures by assigning the contiguous subranges of valid physical addresses to different memory nodes .
[ * ]我们还有这种设计选择的另一个例子:即使硬件架构只定义了两个级别,Linux 也使用四个级别的页表(请参阅第 2 章中的“ Linux 中的分页” 部分)。
[*] We have another example of this kind of design choice: Linux uses four levels of Page Tables even when the hardware architecture defines just two levels (see the section "Paging in Linux" in Chapter 2).
[ * ]为索引保留的位数取决于内核是否支持 NUMA 模型以及字段的大小
flags。如果不支持 NUMA,则该flags字段具有两位用于区域索引和一位(始终设置为零)用于节点索引。在 NUMA 32 位架构上,flags有两位用于区域索引,六位用于节点号。最后,在 NUMA 64 位架构上,64 位flags
字段有 2 位用于区域索引,10 位用于节点编号。
[*] The number of bits reserved for the indices depends on
whether the kernel supports the NUMA model and on the size of the
flags field. If NUMA is not
supported, the flags field has
two bits for the zone index and one bit—always set to zero—for the
node index. On NUMA 32-bit architectures, flags has two bits for the zone index
and six bits for the node number. Finally, on NUMA 64-bit
architectures, the 64-bit flags
field has 2 bits for the zone index and 10 bits for the node
number.
[ * ]系统管理员可以稍后通过写入/proc/sys/vm/min_free_kbytes文件或发出合适的命令来更改保留内存量
sysctl( )
系统调用。
[*] The amount of reserved memory can be changed later by the
system administrator either by writing in the
/proc/sys/vm/min_free_kbytes file or by
issuing a suitable sysctl( )
system call.
本节涉及内存区域 ——也就是说,具有连续物理地址和任意长度的存储单元序列。
This section deals with memory areas —that is, with sequences of memory cells having contiguous physical addresses and an arbitrary length.
伙伴系统算法采用页框作为基本内存区域。这对于处理相对较大的内存请求来说很好,但是我们如何处理小内存区域的请求,比如几十或几百字节呢?
The buddy system algorithm adopts the page frame as the basic memory area. This is fine for dealing with relatively large memory requests, but how are we going to deal with requests for small memory areas, say a few tens or hundreds of bytes?
显然,分配一个完整的页框来存储几个字节是相当浪费的。更好的方法是引入新的数据结构,这些数据结构描述如何在同一页帧内分配小内存区域。在此过程中,我们引入了一个称为内部碎片的新问题。这是由于内存请求的大小与为满足该请求而分配的内存区域的大小不匹配造成的。
Clearly, it would be quite wasteful to allocate a full page frame to store a few bytes. A better approach instead consists of introducing new data structures that describe how small memory areas are allocated within the same page frame. In doing so, we introduce a new problem called internal fragmentation. It is caused by a mismatch between the size of the memory request and the size of the memory area allocated to satisfy the request.
经典的解决方案(早期 Linux 版本采用)包括提供大小呈几何分布的内存区域;换句话说,大小取决于 2 的幂而不是要存储的数据的大小。这样,无论内存请求大小是多少,我们都可以保证内部碎片始终小于50%。按照这种方法,内核创建 13 个几何分布的空闲内存区域列表,其大小范围从 32 到 131, 072 字节。调用伙伴系统既可以获取存储新内存区域所需的附加页帧,也可以相反地释放不再包含内存区域的页帧。动态列表用于跟踪每个页帧中包含的空闲内存区域。
A classical solution (adopted by early Linux versions) consists of providing memory areas whose sizes are geometrically distributed; in other words, the size depends on a power of 2 rather than on the size of the data to be stored. In this way, no matter what the memory request size is, we can ensure that the internal fragmentation is always smaller than 50 percent. Following this approach, the kernel creates 13 geometrically distributed lists of free memory areas whose sizes range from 32 to 131, 072 bytes. The buddy system is invoked both to obtain additional page frames needed to store new memory areas and, conversely, to release page frames that no longer contain memory areas. A dynamic list is used to keep track of the free memory areas contained in each page frame.
在伙伴算法之上运行内存区域分配算法并不是特别有效。更好的算法源自slab分配器 Sun Microsystems Solaris 首次采用的模式2.4 操作系统。它基于以下前提:
Running a memory area allocation algorithm on top of the buddy algorithm is not particularly efficient. A better algorithm is derived from the slab allocator schema that was adopted for the first time in the Sun Microsystems Solaris 2.4 operating system. It is based on the following premises:
要存储的数据类型可能会影响内存区域的分配方式;例如,当向用户模式进程分配页框时,内核调用该get_zeroed_page( )函数,该函数用零填充该页。
平板分配器的概念扩展了这个想法,并将内存区域视为由一组数据结构和一些称为构造函数和 析构函数的函数或方法组成的对象。前者初始化内存区域,后者取消初始化内存区域。
为了避免重复初始化对象,slab分配器不会丢弃已经分配然后释放的对象,而是将它们保存在内存中。当请求新对象时,可以从内存中取出它,而无需重新初始化。
The type of data to be stored may affect how memory areas
are allocated; for instance, when allocating a page frame to a
User Mode process, the kernel invokes the get_zeroed_page( ) function, which fills
the page with zeros.
The concept of a slab allocator expands upon this idea and views the memory areas as objects consisting of both a set of data structures and a couple of functions or methods called the constructor and destructor. The former initializes the memory area while the latter deinitializes it.
To avoid initializing objects repeatedly, the slab allocator does not discard the objects that have been allocated and then released but instead saves them in memory. When a new object is then requested, it can be taken from memory without having to be reinitialized.
内核函数倾向于重复请求相同类型的内存区域。例如,每当内核创建一个新进程时,它都会为一些固定大小的表分配内存区域,例如进程描述符、打开的文件对象等(参见第3章)。当进程终止时,用于包含这些表的内存区域可以被重用。由于进程的创建和销毁非常频繁,如果没有slab分配器,内核会浪费时间重复分配和释放包含相同内存区域的页框;板分配器允许将它们保存在缓存中并快速重用。
The kernel functions tend to request memory areas of the same type repeatedly. For instance, whenever the kernel creates a new process, it allocates memory areas for some fixed size tables such as the process descriptor, the open file object, and so on (see Chapter 3). When a process terminates, the memory areas used to contain these tables can be reused. Because processes are created and destroyed quite frequently, without the slab allocator, the kernel wastes time allocating and deallocating the page frames containing the same memory areas repeatedly; the slab allocator allows them to be saved in a cache and reused quickly.
对内存区域的请求可以根据其频率进行分类。通过创建一组具有正确大小的专用对象,可以最有效地处理预计经常发生的特定大小的请求,从而避免内部碎片。同时,很少遇到的大小可以通过基于一系列几何分布大小的对象(例如早期Linux版本中使用的2的幂大小)的分配方案来处理,即使这种方法会导致内部碎片。
Requests for memory areas can be classified according to their frequency. Requests of a particular size that are expected to occur frequently can be handled most efficiently by creating a set of special-purpose objects that have the right size, thus avoiding internal fragmentation. Meanwhile, sizes that are rarely encountered can be handled through an allocation scheme based on objects in a series of geometrically distributed sizes (such as the power-of-2 sizes used in early Linux versions), even if this approach leads to internal fragmentation.
引入大小不呈几何分布的对象还有另一个微妙的好处:数据结构的初始地址不太容易集中在其值为 2 的幂的物理地址上。这反过来又会带来更好的性能。处理器硬件缓存。
There is another subtle bonus in introducing objects whose sizes are not geometrically distributed: the initial addresses of the data structures are less prone to be concentrated on physical addresses whose values are a power of 2. This, in turn, leads to better performance by the processor hardware cache.
硬件缓存性能为尽可能限制对伙伴系统分配器的调用提供了另一个原因。每次调用伙伴系统函数都会“弄脏”硬件缓存,从而增加平均内存访问时间。内核函数对硬件缓存的影响称为函数 占用空间;它被定义为函数终止时被覆盖的缓存百分比。显然,大的占用空间会导致内核函数之后执行的代码执行速度变慢,因为硬件缓存现在充满了无用的信息。
Hardware cache performance creates an additional reason for limiting calls to the buddy system allocator as much as possible. Every call to a buddy system function "dirties" the hardware cache, thus increasing the average memory access time. The impact of a kernel function on the hardware cache is called the function footprint; it is defined as the percentage of cache overwritten by the function when it terminates. Clearly, large footprints lead to a slower execution of the code executed right after the kernel function, because the hardware cache is by now filled with useless information.
lab 分配器将对象分组到 缓存中 。每个缓存都是相同类型对象的“存储”。例如,当打开一个文件时,存储相应“打开文件”对象所需的内存区域将从名为filp(“文件指针”)的平板分配器缓存中获取。
The slab allocator groups objects into caches . Each cache is a "store" of objects of the same type. For instance, when a file is opened, the memory area needed to store the corresponding "open file" object is taken from a slab allocator cache named filp (for "file pointer").
包含缓存的主存区域被划分为多个 slab ; 每个slab 由一个或多个连续的页框组成,其中包含已分配的对象和空闲对象(见图8-3)。
The area of main memory that contains a cache is divided into slabs ; each slab consists of one or more contiguous page frames that contain both allocated and free objects (see Figure 8-3).
正如我们将在第 17 章中看到的,内核定期扫描缓存并释放与空板对应的页框。
As we'll see in Chapter 17, the kernel periodically scans the caches and releases the page frames corresponding to empty slabs.
kmem_cache_t每个缓存都由一个 type 结构体(相当于 type )来描述struct kmem_cache_s,其字段如表 8-8所示。我们从表中省略了几个用于收集统计信息和调试的字段。
Each cache is described by a structure of type kmem_cache_t (which is equivalent to the
type struct kmem_cache_s), whose
fields are listed in Table
8-8. We omitted from the table several fields used for
collecting statistical information and for debugging.
表 8-8。kmem_cache_t 描述符的字段
Table 8-8. The fields of the kmem_cache_t descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 每个 CPU 的指向自由对象本地缓存的指针数组(请参阅本章后面的“自由 Slab 对象的本地缓存”部分)。 Per-CPU array of pointers to local caches of free objects (see the section "Local Caches of Free Slab Objects" later in this chapter). |
| | 要批量传输到本地缓存或从本地缓存传输的对象数量。 Number of objects to be transferred in bulk to or from the local caches. |
| | 本地缓存中空闲对象的最大数量。这是可调的。 Maximum number of free objects in the local caches. This is tunable. |
| | 参见下表。 See next table. |
| | 缓存中包含的对象的大小。 Size of the objects included in the cache. |
| 旗帜 flags | 描述缓存永久属性的标志集。 Set of flags that describes permanent properties of the cache. |
| 编号 num | 打包到单个slab中的对象数量。(缓存的所有分片都具有相同的大小。) Number of objects packed into a single slab. (All slabs of the cache have the same size.) |
| 自由限制 free_limit | 整个slab缓存中空闲对象的上限。 Upper limit of free objects in the whole slab cache. |
自旋锁_t spinlock_t | 自旋锁 spinlock | 缓存自旋锁。 Cache spin lock. |
| GF订单 gfporder | 单个slab中包含的连续页框数量的对数。 Logarithm of the number of contiguous page frames included in a single slab. |
| gfp标志 gfpflags | 分配页框时传递给伙伴系统函数的标志集。 Set of flags passed to the buddy system function when allocating page frames. |
| 颜色 colour | 板的颜色数量(请参阅本章后面的“板着色”部分)。 Number of colors for the slabs (see the section "Slab Coloring" later in this chapter). |
| 颜色关闭 colour_off | 板中的基本对齐偏移。 Basic alignment offset in the slabs. |
| 下一个颜色 colour_next | 用于下一个分配的板的颜色。 Color to use for the next allocated slab. |
kmem_cache_t * kmem_cache_t * | slabp_cache slabp_cache | 指向包含slab描述符的通用slab缓存的指针( Pointer to the general slab cache
containing the slab descriptors ( |
无符号整数 unsigned int | 板坯尺寸 slab_size | 单个板的尺寸。 The size of a single slab. |
| dflags dflags | 描述缓存动态属性的标志集。 Set of flags that describe dynamic properties of the cache. |
空白 * void * | 导演 ctor | 指向与缓存关联的构造函数方法的指针。 Pointer to constructor method associated with the cache. |
空白 * void * | dtor dtor | 指向与缓存关联的析构函数方法的指针。 Pointer to destructor method associated with the cache. |
常量字符 * const char * | 姓名 name | 存储缓存名称的字符数组。 Character array storing the name of the cache. |
结构列表头 struct list_head | 下一个 next | 缓存描述符双向链表的指针。 Pointers for the doubly linked list of cache descriptors. |
lists描述符的字段
又kmem_cache_t是一个结构体,其字段列于表 8-9中。
The lists field of the
kmem_cache_t descriptor, in turn,
is a structure whose fields are listed in Table 8-9.
表 8-9。kmem_list3 结构的字段
Table 8-9. The fields of the kmem_list3 structure
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
结构列表头 struct list_head | 板坯部分 slabs_partial | 具有自由和非自由对象的slab描述符的双向链接循环列表 Doubly linked circular list of slab descriptors with both free and nonfree objects |
结构列表头 struct list_head | 板坯完整 slabs_full | 没有空闲对象的slab描述符的双向链接循环列表 Doubly linked circular list of slab descriptors with no free objects |
结构列表头 struct list_head | | 仅具有自由对象的slab描述符的双向链接循环列表 Doubly linked circular list of slab descriptors with free objects only |
| | 缓存中空闲对象的数量 Number of free objects in the cache |
| | 由slab分配器的页面回收算法使用(参见第17章) Used by the slab allocator's page reclaiming algorithm (see Chapter 17) |
| | 由slab分配器的页面回收算法使用(参见第17章) Used by the slab allocator's page reclaiming algorithm (see Chapter 17) |
| | 指向所有CPU共享的本地缓存的指针(参见后面章节“空闲Slab对象的本地缓存”) Pointer to a local cache shared by all CPUs (see the later section "Local Caches of Free Slab Objects") |
高速缓存的每个分片都有其自己的描述符类型,
如表 8-10slab所示。
Each slab of a cache has its own descriptor of type
slab illustrated in Table 8-10.
表 8-10。板描述符的字段
Table 8-10. The fields of the slab descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 三个slab描述符双向链表之一的指针(缓存描述符结构中的 Pointers for one of the three doubly
linked list of slab descriptors (either the |
| | 板中第一个对象的偏移(请参阅本章后面的“板着色”部分) Offset of the first object in the slab (see the section "Slab Coloring" later in this chapter) |
| | 板中第一个对象(已分配或空闲)的地址 Address of first object (either allocated or free) in the slab |
| | 当前正在使用的slab中的对象数量(非空闲) Number of objects in the slab that are currently used (not free) |
| | 板中下一个自由对象的索引,或者 Index of next free object in the
slab, or |
板描述符可以存储在两个可能的位置:
Slab descriptors can be stored in two possible places:
Stored outside the slab, in one of the general
caches not suitable for ISA DMA pointed to by cache_sizes (see the next
section).
存储在slab内部,位于分配给该slab的第一个页框的开头。
Stored inside the slab, at the beginning of the first page frame assigned to the slab.
当对象的大小小于 512MB 或内部时,slab 分配器选择第二种解决方案碎片在slab内部为slab描述符和对象描述符(如下所述)留下了足够的空间。CFLGS_OFF_SLAB如果slab描述符存储在slab之外,则高速缓存描述符字段中的标志设置flags为1;否则它被设置为零。
The slab allocator chooses the second solution when the size of
the objects is smaller than 512MB or when internal fragmentation leaves enough space for the slab
descriptor and the object descriptors (as described later)—inside the
slab. The CFLGS_OFF_SLAB flag in
the flags field of the cache
descriptor is set to one if the slab descriptor is stored outside the
slab; it is set to zero otherwise.
图8-4 说明了缓存和slab描述符之间的主要关系。完整板、部分完整板和空闲板链接在不同的列表中。
Figure 8-4 illustrates the major relationships between cache and slab descriptors. Full slabs, partially full slabs, and free slabs are linked in different lists.
缓存分为两种类型:通用缓存和专用缓存。 通用缓存仅由slab分配器用于其自身目的,而特定缓存 被内核的其余部分使用。
Caches are divided into two types: general and specific. General caches are used only by the slab allocator for its own purposes, while specific caches are used by the remaining parts of the kernel.
一般的缓存有:
The general caches are:
第一个缓存称为kmem_cache,其对象是内核使用的其余缓存的缓存描述符。该cache_cache
变量包含该特殊缓存的描述符。
A first cache called kmem_cache whose
objects are the cache descriptors of the remaining caches used by
the kernel. The cache_cache
variable contains the descriptor of this special cache.
几个附加高速缓存包含通用内存区域。存储区域大小的范围通常包括13个几何分布的大小。名为 的表malloc_sizes(其元素类型为
cache_sizes)指向与大小为 32、64、128、256、512、1,024、2,048、4,096、8,192、16,384、32,768、65,536 和 131,072 字节的内存区域关联的 26 个缓存描述符。对于每种大小,都有两个缓存:一个适合 ISA DMA 分配,另一个适合普通分配。
Several additional caches contain general purpose memory
areas. The range of the memory area sizes typically includes 13
geometrically distributed sizes. A table called malloc_sizes (whose elements are of type
cache_sizes) points to 26 cache
descriptors associated with memory areas of size 32, 64, 128, 256,
512, 1,024, 2,048, 4,096, 8,192, 16,384, 32,768, 65,536, and
131,072 bytes. For each size, there are two caches: one suitable
for ISA DMA allocations and the other for normal
allocations.
该kmem_cache_init( )
函数在系统初始化期间调用以设置通用缓存。
The kmem_cache_init( )
function is invoked during system initialization to set up the general
caches.
特定的缓存由该函数创建kmem_cache_create( )。根据参数,该函数首先确定处理新缓存的最佳方式(例如,是否将slab描述符包含在slab内部或外部)。cache_cache然后,它从通用缓存中为新缓存分配一个缓存描述符,并将该描述符插入cache_chain缓存描述符列表中(插入是在获取
cache_chain_sem保护列表免受并发访问的信号量之后完成的)。
Specific caches are created by the kmem_cache_create( ) function. Depending on
the parameters, the function first determines the best way to handle
the new cache (for instance, whether to include the slab descriptor
inside or outside of the slab). It then allocates a cache descriptor
for the new cache from the cache_cache general cache and inserts the
descriptor in the cache_chain list
of cache descriptors (the insertion is done after having acquired the
cache_chain_sem semaphore that
protects the list from concurrent accesses).
还可以
cache_chain通过调用来销毁缓存并将其从列表
中删除kmem_cache_destroy( )。此函数对于在加载时创建自己的缓存并在卸载时销毁它们的模块最有用。为了避免浪费内存空间,内核必须在销毁缓存本身之前销毁所有slab。该kmem_cache_shrink( )函数通过迭代调用来销毁缓存中的所有slab (请参阅后面的“从缓存中释放Slabslab_destroy( ) ”部分)。
It is also possible to destroy a cache and remove it from the
cache_chain list by invoking
kmem_cache_destroy( ). This
function is mostly useful to modules that create their own caches when
loaded and destroy them when unloaded. To avoid wasting memory space,
the kernel must destroy all slabs before destroying the cache itself.
The kmem_cache_shrink( ) function
destroys all the slabs in a cache by invoking slab_destroy( ) iteratively (see the later
section "Releasing a Slab
from a Cache").
所有通用和特定缓存的名称可以在运行时通过读取/proc/slabinfo获得;该文件还指定每个缓存中空闲对象的数量和已分配对象的数量。
The names of all general and specific caches can be obtained at runtime by reading /proc/slabinfo; this file also specifies the number of free objects and the number of allocated objects in each cache.
当slab分配器创建一个新的slab时,它依赖分区页框分配器来获取一组空闲的连续页框。为此,它调用该kmem_getpages( )函数,该函数在 UMA 系统上本质上等效于以下代码片段:
When the slab allocator creates a new slab, it relies on
the zoned page frame allocator to obtain a group of free contiguous
page frames. For this purpose, it invokes the kmem_getpages( ) function, which is
essentially equivalent, on a UMA system, to the following code
fragment:
无效* kmem_getpages(kmem_cache_t *cachep,int标志)
{
结构页*页;
整数我;
标志 |= cachep->gfpflags;
page = alloc_pages(flags,cachep->gfporder);
如果(!页)
返回空值;
i = (1 << 缓存->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
atomic_add(i, &slab_reclaim_pages);
而 (--i >= 0)
SetPageSlab(页+i);
返回页面地址(页面);
}void * kmem_getpages(kmem_cache_t *cachep, int flags)
{
struct page *page;
int i;
flags |= cachep->gfpflags;
page = alloc_pages(flags, cachep->gfporder);
if (!page)
return NULL;
i = (1 << cache->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
atomic_add(i, &slab_reclaim_pages);
while (--i >= 0)
SetPageSlab(page+i);
return page_address(page);
}这两个参数的含义如下:
The two parameters have the following meaning:
cachepcachep指向需要额外页框的缓存的缓存描述符(所需页框的数量由字段中的顺序决定cachep->gfporder)。
Points to the cache descriptor of the cache that needs
additional page frames (the number of required page frames is
determined by the order in the cachep->gfporder field).
flagsflags指定如何请求页框(请参阅本章前面的“分区页框分配器”部分)。这组标志与存储在高速缓存描述符字段中的特定高速缓存分配标志相结合gfpflags。
Specifies how the page frame is requested (see the section
"The Zoned Page
Frame Allocator" earlier in this chapter). This set of
flags is combined with the specific cache allocation flags
stored in the gfpflags field
of the cache descriptor.
内存分配请求的大小由缓存描述符的字段指定
gfporder,该字段编码缓存中某个slab的大小。[ * ]如果slab缓存是在SLAB_RECLAIM_ACCOUNT设置了标志的情况下创建的,那么当内核检查是否有足够的内存来满足某些用户模式请求时,分配给slab的页帧将被视为可回收页面。PG_slab该函数还在分配的页框的页描述符中设置该标志。
The size of the memory allocation request is specified by the
gfporder field of the cache
descriptor, which encodes the size of a slab in the cache.[*] If the slab cache has been created with the SLAB_RECLAIM_ACCOUNT flag set, the page
frames assigned to the slabs are accounted for as reclaimable pages
when the kernel checks whether there is enough memory to satisfy some
User Mode requests. The function also sets the PG_slab flag in the page descriptors of the
allocated page frames.
在相反的操作中,可以通过调用以下函数来释放分配给slab的页框(请参阅本章后面的“从缓存中释放Slab ”一节) kmem_freepages( ):
In the reverse operation, page frames assigned to a slab can be
released (see the section "Releasing a Slab from a
Cache" later in this chapter) by invoking the kmem_freepages( ) function:
无效 kmem_freepages(kmem_cache_t *cachep, 无效 *addr)
{
无符号长 i = (1<<cachep->gfporder);
struct page *page = virt_to_page(addr);
if (当前->reclaim_state)
当前->reclaim_state->reclaimed_slab += i;
当我 - )
ClearPageSlab(页面++);
free_pages((无符号长) addr,cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
atomic_sub(1<<cachep->gfporder, &slab_reclaim_pages);
}void kmem_freepages(kmem_cache_t *cachep, void *addr)
{
unsigned long i = (1<<cachep->gfporder);
struct page *page = virt_to_page(addr);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += i;
while (i--)
ClearPageSlab(page++);
free_pages((unsigned long) addr, cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
atomic_sub(1<<cachep->gfporder, &slab_reclaim_pages);
}该函数从具有线性地址 的页帧开始释放页帧addr,该页帧已分配给由 标识的高速缓存板
cachep。如果当前进程正在执行内存回收(current->reclaim_state字段不是NULL),则适当增加reclaimed_slab该结构的字段reclaim_state,以便页框回收算法可以考虑到刚刚释放的页面(请参阅第 1 章中的“低内存回收”部分) 17)。此外,如果SLAB_RECLAIM_ACCOUNT设置了标志(见上文),则slab_reclaim_pages
变量会适当减少。
The function releases the page frames, starting from the one
having the linear address addr,
that had been allocated to the slab of the cache identified by
cachep. If the current process is
performing memory reclaiming (current->reclaim_state field not NULL), the reclaimed_slab field of the reclaim_state structure is properly
increased, so that the pages just freed can be accounted for by the
page frame reclaiming algorithm (see the section "Low On Memory
Reclaiming" in Chapter
17). Moreover, if the SLAB_RECLAIM_ACCOUNT flag is set (see
above), the slab_reclaim_pages
variable is properly decreased.
新创建的缓存不包含slab,因此不包含任何空闲对象。仅当以下两个条件都成立时,新的slab才会分配给缓存:
A newly created cache does not contain a slab and therefore does not contain any free objects. New slabs are assigned to a cache only when both of the following are true:
已发出分配新对象的请求。
A request has been issued to allocate a new object.
缓存不包含空闲对象。
The cache does not include a free object.
在这种情况下,slab 分配器通过调用 向缓存分配一个新的slab cache_grow(
)。该函数调用kmem_
getpages( )从分区页框分配器获取存储单个slab所需的页框组;然后它调用
alloc_slabmgmt( )以获取新的平板描述符。如果CFLGS_OFF_SLAB
设置了缓存描述符的标志,则从缓存描述符的字段指向的通用缓存中分配slab描述符slabp_cache;否则,该slab描述符将被分配在该slab的第一个页帧中。
Under these circumstances, the slab allocator assigns a new slab
to the cache by invoking cache_grow(
). This function calls kmem_
getpages( ) to obtain from the zoned page frame allocator
the group of page frames needed to store a single slab; it then calls
alloc_slabmgmt( ) to get a new slab
descriptor. If the CFLGS_OFF_SLAB
flag of the cache descriptor is set, the slab descriptor is allocated
from the general cache pointed to by the slabp_cache field of the cache descriptor;
otherwise, the slab descriptor is allocated in the first page frame of
the slab.
给定一个页框,内核必须能够确定它是否被slab分配器使用,如果是,则能够快速导出相应的缓存和slab描述符的地址。因此,cache_ grow( )扫描分配给新slab的页框的所有页描述符,并分别用缓存描述符和slab描述符的地址加载页描述符中字段的 和 子字段next。这是正确的,因为只有当页框空闲时,伙伴系统的函数才使用该字段,而由slab分配器函数处理的页框设置了标志,并且就伙伴系统而言不是空闲的。prevlrulruPG_slab[ * ]相反的问题——给定缓存中的一个slab,实现它的页框是哪些?——可以通过使用
s_memslab描述符的字段和gfporder缓存描述符的字段(slab的大小)来回答。
The kernel must be able to determine, given a page frame,
whether it is used by the slab allocator and, if so, to derive quickly
the addresses of the corresponding cache and slab descriptors.
Therefore, cache_ grow( ) scans all
page descriptors of the page frames assigned to the new slab, and
loads the next and prev subfields of the lru fields in the page descriptors with the
addresses of, respectively, the cache descriptor and the slab
descriptor. This works correctly because the lru field is used by functions of the buddy
system only when the page frame is free, while page frames handled by
the slab allocator functions have the PG_slab flag set and are not free as far as
the buddy system is concerned.[*] The opposite question—given a slab in a cache, which are
the page frames that implement it?—can be answered by using the
s_mem field of the slab descriptor
and the gfporder field (the size of
a slab) of the cache descriptor.
接下来,cache_grow( )调用
cache_init_objs( ),它将构造函数方法(如果已定义)应用于新平板中包含的所有对象。
Next, cache_grow( ) calls
cache_init_objs( ), which applies
the constructor method (if defined) to all the objects contained in
the new slab.
最后,cache_ grow( )调用
list_add_tail( )将新获得的slab描述符添加*slabp到缓存描述符的完全空闲slab列表的末尾*cachep,并更新缓存中空闲对象的计数器:
Finally, cache_ grow( ) calls
list_add_tail( ) to add the newly
obtained slab descriptor *slabp at
the end of the fully free slab list of the cache descriptor *cachep, and updates the counter of free
objects in the cache:
list_add_tail(&slabp->list, &cachep->lists->slabs_free); cachep->lists->free_objects +=cachep->num;
list_add_tail(&slabp->list, &cachep->lists->slabs_free); cachep->lists->free_objects += cachep->num;
Slabs can be destroyed in two cases:
Slab 缓存中的空闲对象过多(请参阅后面的“从缓存中释放 Slab ”一节)。
There are too many free objects in the slab cache (see the later section "Releasing a Slab from a Cache").
定期调用的计时器函数确定是否存在可以释放的完全未使用的slab(参见第17章)。
A timer function invoked periodically determines that there are fully unused slabs that can be released (see Chapter 17).
在这两种情况下,slab_destroy(
)都会调用该函数来销毁slab并将相应的页框释放到分区页框分配器:
In both cases, the slab_destroy(
) function is invoked to destroy a slab and release the
corresponding page frames to the zoned page frame allocator:
无效slab_destroy(kmem_cache_t *cachep,slab_t *slabp)
{
if (cachep->dtor) {
整数我;
for (i = 0; i <cachep->num; i++) {
void* objp=slabp->s_mem+cachep->objsize*i;
(cachep->dtor)(objp,cachep,0);
}
}
kmem_freepages(cachep,slabp->s_mem-slabp->colouroff);
if (cachep->flags & CFLGS_OFF_SLAB)
kmem_cache_free(cachep->slabp_cache,slabp);
}void slab_destroy(kmem_cache_t *cachep, slab_t *slabp)
{
if (cachep->dtor) {
int i;
for (i = 0; i < cachep->num; i++) {
void* objp = slabp->s_mem+cachep->objsize*i;
(cachep->dtor)(objp, cachep, 0);
}
}
kmem_freepages(cachep, slabp->s_mem - slabp->colouroff);
if (cachep->flags & CFLGS_OFF_SLAB)
kmem_cache_free(cachep->slabp_cache, slabp);
}该函数检查缓存是否有其对象的析构函数(该dtor字段不是NULL),在这种情况下,它将析构函数应用于slab中的所有对象;局部objp变量跟踪当前检查的对象。接下来,它调用kmem_freepages( ),它将该slab 使用的所有连续页框返回给伙伴系统。最后,如果slab描述符存储在slab之外,则该函数将其从slab描述符的缓存中释放。
The function checks whether the cache has a destructor method
for its objects (the dtor field is
not NULL), in which case it applies
the destructor to all the objects in the slab; the objp local variable keeps track of the
currently examined object. Next, it calls kmem_freepages( ), which returns all the
contiguous page frames used by the slab to the buddy system. Finally,
if the slab descriptor is stored outside of the slab, the function
releases it from the cache of slab descriptors .
实际上,该功能稍微复杂一些。例如,可以使用该SLAB_DESTROY_BY_RCU标志创建slab缓存,这意味着应该通过向函数注册回调来以延迟的方式释放slab (请参阅第5章中的“读-复制更新(RCU) ”call_rcu( )部分)。回调函数依次调用和(可能),如上面的主要情况所示。kmem_freepages()kmem_cache_free()
Actually, the function is slightly more complicated. For
example, a slab cache can be created with the SLAB_DESTROY_BY_RCU flag, which means that
slabs should be released in a deferred way by registering a callback
with the call_rcu( ) function (see
the section "Read-Copy
Update (RCU)" in Chapter
5). The callback function, in turn, invokes kmem_freepages() and, possibly, the kmem_cache_free(), as in the main case shown
above.
每个对象都有一个类型为 的短描述符kmem_bufctl_t。对象描述符存储在紧随相应的slab描述符之后的数组中。因此,就像slab描述符本身一样,slab的对象描述符可以用两种可能的方式存储,如图
8-5所示。
Each object has a short descriptor of type kmem_bufctl_t. Object descriptors are stored
in an array placed right after the corresponding slab descriptor.
Thus, like the slab descriptors themselves, the object descriptors of
a slab can be stored in two possible ways that are illustrated in
Figure 8-5.
存储在slab之外,在slabp_cache缓存描述符的字段所指向的通用缓存中。内存区域的大小以及用于存储对象描述符的特定通用高速缓存取决于存储在slab(高速num缓存描述符的字段)中的对象的数量。
Stored outside the slab, in the general cache pointed to
by the slabp_cache field of
the cache descriptor. The size of the memory area, and thus the
particular general cache used to store object descriptors,
depends on the number of objects stored in the slab (num field of the cache
descriptor).
存储在板内,就在它们描述的对象之前。
Stored inside the slab, right before the objects they describe.
数组中的第一个对象描述符描述了slab中的第一个对象,依此类推。对象描述符只是一个无符号短整数,仅当对象空闲时才有意义。它包含slab中下一个自由对象的索引,从而实现了slab内自由对象的简单列表。BUFCTL_END
空闲对象列表中最后一个元素的对象描述符由约定值( )标记0xffff。
The first object descriptor in the array describes the first
object in the slab, and so on. An object descriptor is simply an
unsigned short integer, which is meaningful only when the object is
free. It contains the index of the next free object in the slab, thus
implementing a simple list of free objects inside the slab. The object
descriptor of the last element in the free object list is marked by
the conventional value BUFCTL_END
(0xffff).
由slab分配器管理的对象在内存中是 对齐的——也就是说,它们存储在初始物理地址是给定常数(通常是2的幂)倍数的内存单元中。这个常数称为对齐 因子。
The objects managed by the slab allocator are aligned in memory—that is, they are stored in memory cells whose initial physical addresses are multiples of a given constant, which is usually a power of 2. This constant is called the alignment factor.
板分配器允许的最大对齐因子是 4,096——页框大小。这意味着对象可以通过引用它们的物理地址或它们的线性地址来对齐。在这两种情况下,只有地址的 12 个最低有效位可以通过对齐方式更改。
The largest alignment factor allowed by the slab allocator is 4,096—the page frame size. This means that objects can be aligned by referring to either their physical addresses or their linear addresses. In both cases, only the 12 least significant bits of the address may be altered by the alignment.
通常,如果微计算机的物理地址与字大小(即计算机内部存储器总线的宽度)对齐,则微型计算机可以更快地访问存储单元。因此,默认情况下,该kmem_cache_create(
)函数根据宏指定的字长来对齐对象BYTES_PER_WORD
。对于 80 × 86 处理器,宏生成值 4,因为该字长为 32 位。
Usually, microcomputers access memory cells more quickly if
their physical addresses are aligned with respect to the word size
(that is, to the width of the internal memory bus of the computer).
Thus, by default, the kmem_cache_create(
) function aligns objects according to the word size
specified by the BYTES_PER_WORD
macro. For 80 × 86 processors, the macro yields the value 4 because
the word is 32 bits long.
创建新的slab缓存时,可以指定其中包含的对象在第一级硬件缓存中对齐。为了实现这一点,内核设置SLAB_HWCACHE_ALIGN缓存描述符标志。该kmem_cache_create( )函数按如下方式处理请求:
When creating a new slab cache, it's possible to specify that
the objects included in it be aligned in the first-level hardware
cache. To achieve this, the kernel sets the SLAB_HWCACHE_ALIGN cache descriptor flag.
The kmem_cache_create( ) function
handles the request as follows:
如果对象的大小大于缓存行的一半,它将在 RAM 中对齐到L1_CACHE_BYTES该行的开头的倍数。
If the object's size is greater than half of a cache line,
it is aligned in RAM to a multiple of L1_CACHE_BYTES—that is, at the beginning
of the line.
否则,对象大小将向上舍入为
L1_CACHE_BYTES;的约数。这可以确保小对象永远不会跨越两个缓存线。
Otherwise, the object size is rounded up to a submultiple of
L1_CACHE_BYTES; this ensures
that a small object will never span across two cache lines.
显然,slab 分配器在这里所做的就是用内存空间换取访问时间;它通过人为地增加对象大小来获得更好的缓存性能,从而导致额外的内部碎片。
Clearly, what the slab allocator is doing here is trading memory space for access time; it gets better cache performance by artificially increasing the object size, thus causing additional internal fragmentation.
从第 2 章我们知道 ,同一个硬件缓存行映射许多不同的 RAM 块。在本章中,我们还看到相同大小的对象最终存储在缓存中的相同偏移量处。在不同slab内具有相同偏移量的对象将以相对较高的概率最终映射到相同的缓存行中。因此,高速缓存硬件可能会浪费内存周期,将两个对象从同一高速缓存行来回传输到不同的 RAM 位置,而其他高速缓存行则未得到充分利用。板分配器尝试通过称为板着色的策略来减少这种令人不快的缓存行为 :将不同的任意值(称为 颜色)分配给板。
We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this chapter, we have also seen that objects of the same size end up being stored at the same offset within a cache. Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects from the same cache line back and forth to different RAM locations, while other cache lines go underutilized. The slab allocator tries to reduce this unpleasant cache behavior by a policy called slab coloring : different arbitrary values called colors are assigned to the slabs.
在检查板着色之前,我们必须查看缓存中对象的布局。让我们考虑一个其对象在 RAM 中对齐的缓存。这意味着对象地址必须是给定正值(例如aln )的倍数。即使考虑到对齐约束,也有许多可能的方法将对象放置在板内。这些选择取决于对以下变量做出的决策:
Before examining slab coloring, we have to look at the layout of objects in the cache. Let's consider a cache whose objects are aligned in RAM. This means that the object address must be a multiple of a given positive value, say aln. Even taking the alignment constraint into account, there are many possible ways to place objects inside the slab. The choices depend on decisions made for the following variables:
一个slab中可以存储的对象数量(其值在num缓存描述符的字段中)。
Number of objects that can be stored in a slab (its value
is in the num field of the
cache descriptor).
对象大小,包括对齐字节。
Object size, including the alignment bytes.
板描述符大小加上所有对象描述符大小,向上舍入到硬件缓存行大小的最小倍数。如果slab 和对象描述符存储在slab 外部,则其值等于0。
Slab descriptor size plus all object descriptors size, rounded up to the smallest multiple of the hardware cache line size. Its value is equal to 0 if the slab and object descriptors are stored outside of the slab.
板内未使用的字节数(未分配给任何对象的字节)。
Number of unused bytes (bytes not assigned to any object) inside the slab.
那么slab的总长度(以字节为单位)可以表示为:
The total length in bytes of a slab can then be expressed as:
板长度= ( num × osize ) + dsize + free
slab length = (num × osize) + dsize+ free
free总是小于 osize,因为否则,可以在板内放置额外的对象。但是, free可能大于 aln。
free is always smaller than osize, because otherwise, it would be possible to place additional objects inside the slab. However, free could be greater than aln.
板分配器利用 空闲的未使用字节来为板着色。术语“颜色”仅用于细分块并允许内存分配器将对象分布在不同的线性地址之间。通过这种方式,内核可以从微处理器的硬件缓存中获得最佳的性能。
The slab allocator takes advantage of the free unused bytes to color the slab. The term "color" is used simply to subdivide the slabs and allow the memory allocator to spread objects out among different linear addresses. In this way, the kernel obtains the best possible performance from the microprocessor's hardware cache.
具有不同颜色的板将板的第一个对象存储在不同的存储位置,同时满足对齐约束。可用颜色的数量为
free/aln(该值存储在colour缓存描述符的字段中)。因此,第一个颜色表示为 0,最后一个颜色表示为
(free / aln )−1。(作为一种特殊情况,如果
free低于aln,
colour则设置为 0,但所有板都使用颜色 0,因此实际上颜色的数量是一种。)
Slabs having different colors store the first object of the slab
in different memory locations, while satisfying the alignment
constraint. The number of available colors is
free/aln (this value is stored in the colour field of the cache descriptor). Thus,
the first color is denoted as 0 and the last one is denoted as
(free / aln)−1. (As a particular case, if
free is lower than aln,
colour is set to 0, nevertheless
all slabs use color 0, thus really the number of colors is
one.)
如果一个slab使用颜色col着色,则第一个对象的偏移量(相对于slab初始地址)等于col × aln + dsize 字节。图 8-6 说明了板内对象的放置如何取决于板的颜色。着色本质上会导致将板的一些自由区域从末端移动到开头。
If a slab is colored with color col, the offset of the first object (with respect to the slab initial address) is equal to col× aln + dsize bytes. Figure 8-6 illustrates how the placement of objects inside the slab depends on the slab color. Coloring essentially leads to moving some of the free area of the slab from the end to the beginning.
仅当空闲空间足够大时,着色才起作用。显然,如果对象不需要对齐,或者板内未使用的字节数小于所需的对齐(free < aln),则唯一可能的板着色是颜色为 0 的板着色 - 分配颜色的板着色到第一个对象的零偏移量。
Coloring works only when free is large enough. Clearly, if no alignment is required for the objects or if the number of unused bytes inside the slab is smaller than the required alignment (free < aln), the only possible slab coloring is the one that has the color 0—the one that assigns a zero offset to the first object.
通过将当前颜色存储在名为 的缓存描述符的字段中,各种颜色在给定对象类型的板之间均匀分布colour_next。该
cache_ grow( )函数将 指定的颜色分配colour_next给新的板,然后增加该字段的值。达到 后
colour,它会再次回绕到 0。这样,每个板都会使用与前一个板不同的颜色创建,直到达到最大可用颜色。此外,该函数从缓存描述符的字段中cache_grow( )获取值aln ,根据slab内的对象数量
计算dsize ,最后存储值colcolour_off板描述符字段中的
× aln + dsize 。colouroff
The various colors are distributed equally among slabs of a
given object type by storing the current color in a field of the cache
descriptor called colour_next. The
cache_ grow( ) function assigns the
color specified by colour_next to a
new slab and then increases the value of this field. After reaching
colour, it wraps around again to 0.
In this way, each slab is created with a different color from the
previous one, up to the maximum available colors. The cache_grow( ) function, moreover, gets the
value aln from the colour_off field of the cache descriptor,
computes dsize according to the number of objects
inside the slab, and finally stores the value
col× aln + dsize in the
colouroff field of the slab
descriptor.
多处理器系统的slab分配器的Linux 2.6实现与原始 Solaris 不同2.4. 为了减少处理器之间的自旋锁争用并更好地利用硬件缓存,slab 分配器的每个缓存都包含一个每个 CPU 的数据结构,该数据结构由一小组指向已释放对象的指针组成,称为slab本地缓存 。大多数slab对象的分配和释放仅影响本地缓存;仅当本地缓存下溢或溢出时,slab 数据结构才会涉及。该技术与本章前面的“每 CPU 页帧高速缓存”一节中介绍的技术非常相似。
The Linux 2.6 implementation of the slab allocator for multiprocessor systems differs from that of the original Solaris 2.4. To reduce spin lock contention among processors and to make better use of the hardware caches, each cache of the slab allocator includes a per-CPU data structure consisting of a small array of pointers to freed objects called the slab local cache . Most allocations and releases of slab objects affect the local cache only; the slab data structures get involved only when the local cache underflows or overflows. This technique is quite similar to the one illustrated in the section "The Per-CPU Page Frame Cache" earlier in this chapter.
高速缓存描述符的字段array是一组指向array_cache数据结构的指针,系统中的每个 CPU 都有一个元素。每个array_cache数据结构都是本地缓存的空闲对象的描述符,其字段如
表8-11所示。
The array field of the cache
descriptor is an array of pointers to array_cache data structures, one element for
each CPU in the system. Each array_cache data structure is a descriptor
of the local cache of free objects, whose fields are illustrated in
Table 8-11.
表 8-11。array_cache 结构体的字段
Table 8-11. The fields of the array_cache structure
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 指向本地缓存中可用对象的指针数量。该字段还充当缓存中第一个空闲槽的索引。 Number of pointers to available objects in the local cache. The field also acts as the index of the first free slot in the cache. |
| | 本地缓存的大小——即本地缓存中的最大指针数。 Size of the local cache—that is, the maximum number of pointers in the local cache. |
| | 本地缓存重新填充或清空的块大小。 Chunk size for local cache refill or emptying. |
| | 如果最近使用过本地缓存,则标志设置为 1。 Flag set to 1 if the local cache has been recently used. |
请注意,本地缓存描述符不包括本地缓存本身的地址;事实上,本地缓存位于描述符之后。当然,本地缓存存储的是指向已释放对象的指针,而不是对象本身,它们总是放置在缓存的板中。
Notice that the local cache descriptor does not include the address of the local cache itself; in fact, the local cache is placed right after the descriptor. Of course, the local cache stores the pointers to the freed objects, not the object themselves, which are always placed inside the slabs of the cache.
当创建新的slab缓存时,该kmem_cache_create( )函数确定本地缓存的大小(将该值存储在limit缓存描述符的字段中),分配它们,并将它们的指针存储到array缓存描述符的字段中。
When creating a new slab cache, the kmem_cache_create( ) function determines the
size of the local caches (storing this value in the limit field of the cache descriptor),
allocates them, and stores their pointers into the array field of the cache descriptor.
当创建新的slab缓存时,该kmem_cache_create( )函数确定本地缓存的大小(将该值存储在limit缓存描述符的字段中),分配它们,并将它们的指针存储到array缓存描述符的字段中。大小取决于slab缓存中存储的对象的大小,范围从1(非常大的对象)到120(小对象)。此外,该batchcount
字段的初始值,即在本地缓存中添加或删除的块中的对象数量,最初被设置为本地缓存大小的一半。[ * ]
When creating a new slab cache, the kmem_cache_create( ) function determines the
size of the local caches (storing this value in the limit field of the cache descriptor),
allocates them, and stores their pointers into the array field of the cache descriptor. The
size depends on the size of the objects stored in the slab cache, and
ranges from 1 for very large objects to 120 for small ones. Moreover,
the initial value of the batchcount
field, which is the number of objects added or removed in a chunk from
a local cache, is initially set to half of the local cache
size.[*]
在多处理器系统中,小对象的slab缓存还包括一个额外的本地缓存,其地址存储在
lists.shared缓存描述符的字段中。共享本地缓存 顾名思义,它在所有 CPU 之间共享,并且它使得将空闲对象从本地缓存迁移到另一个缓存的任务变得更容易(请参阅下一节)。其初始大小等于字段值的八倍batchcount
。
In multiprocessor systems, slab caches for small objects also
include an additional local cache, whose address is stored in the
lists.shared field of the cache
descriptor. The shared local cache is, as the name suggests, shared among all CPUs, and it
makes the task of migrating free objects from a local cache to another
easier (see the following section). Its initial size is equal to eight
times the value of the batchcount
field.
可以通过调用该函数来获取新对象kmem_cache_alloc( )。该参数
cachep指向必须从中获取新自由对象的高速缓存描述符,而该参数flag表示如果高速缓存的所有板都已满,则要传递给分区页框分配器函数的标志。
New objects may be obtained by invoking the kmem_cache_alloc( ) function. The parameter
cachep points to the cache
descriptor from which the new free object must be obtained, while the
parameter flag represents the flags
to be passed to the zoned page frame allocator functions, should all
slabs of the cache be full.
该函数本质上等同于以下内容:
The function is essentially equivalent to the following:
无效* kmem_cache_alloc(kmem_cache_t *cachep,int标志)
{
无符号长保存标志;
无效*objp;
结构数组缓存 *ac;
local_irq_save(save_flags);
ac = cache_p->数组[smp_processor_id()];
如果(ac->可用){
ac->触摸=1;
objp = ((void **)(ac+1))[--ac->avail];
} 别的
objp = cache_alloc_refill(cachep, 标志);
local_irq_restore(save_flags);
返回 objp;
}void * kmem_cache_alloc(kmem_cache_t *cachep, int flags)
{
unsigned long save_flags;
void *objp;
struct array_cache *ac;
local_irq_save(save_flags);
ac = cache_p->array[smp_processor_id()];
if (ac->avail) {
ac->touched = 1;
objp = ((void **)(ac+1))[--ac->avail];
} else
objp = cache_alloc_refill(cachep, flags);
local_irq_restore(save_flags);
return objp;
}该函数首先尝试从本地缓存中检索空闲对象。如果存在空闲对象,则该avail字段包含指向最后一个释放对象的条目的本地缓存中的索引。因为本地缓存数组存储在ac描述符之后,((void**)(ac+1))[--ac->avail]所以获取该空闲对象的地址并减少 的值ac->avail。cache_alloc_refill( )当本地缓存中没有空闲对象时,调用该函数重新填充本地缓存并获取空闲对象。
The function tries first to retrieve a free object from the
local cache. If there are free objects, the avail field contains the index in the local
cache of the entry that points to the last freed object. Because the
local cache array is stored right after the ac descriptor, ((void**)(ac+1))[--ac->avail] gets the
address of that free object and decreases the value of ac->avail. The cache_alloc_refill( ) function is invoked to
repopulate the local cache and get a free object when there are no
free objects in the local cache.
该cache_alloc_refill( )
函数主要执行以下步骤:
The cache_alloc_refill( )
function essentially performs the following steps:
ac将本地缓存描述符的地址存储在本地变量中:
ac = cachep->数组[smp_processor_id()];
Stores in the ac local
variable the address of the local cache descriptor:
ac = cachep->array[smp_processor_id()];
获取cachep->spinlock.
Gets the cachep->spinlock.
ac->batchcount如果slab缓存包含共享本地缓存,并且共享本地缓存包含一些空闲对象,则它通过向上移动到来自共享本地缓存的指针来重新填充CPU的本地缓存。然后,跳转到步骤 6。
If the slab cache includes a shared local cache, and if the
shared local cache includes some free objects, it refills the
CPU's local cache by moving up to ac->batchcount pointers from the
shared local cache. Then, it jumps to step 6.
尝试使用最多指向ac->batchcount缓存板中包含的空闲对象的指针来填充本地缓存:
查找缓存描述符的slabs_partial和列表,并获取其对应的slab 部分填充或为空的slab 描述符的slabs_free地址。slabp如果不存在这样的描述符,则该函数转到步骤 5。
对于slab中的每个空闲对象,该函数都会增加inuseslab描述符的字段,将对象的地址插入到本地缓存中,并更新该free字段,以便它存储slab中下一个空闲对象的索引:
slabp->inuse++;
((void**)(ac+1))[ac->avail++] =
slabp->s_mem + slabp->free * cachep->obj_size;
slabp->free = ((kmem_bufctl_t*)(slabp+1))[slabp->free];如有必要,将耗尽的板插入正确的列表(slab_full
或slab_partial
列表)中。
Tries to fill the local cache with up to ac->batchcount pointers to free
objects included in the slabs of the cache:
Looks in the slabs_partial and slabs_free lists of the cache
descriptor, and gets the address slabp of a slab descriptor whose
corresponding slab is either partially filled or empty. If no
such descriptor exists, the function goes to step 5.
For each free object in the slab, the function increases
the inuse field of the slab
descriptor, inserts the object's address in the local cache,
and updates the free field
so that it stores the index of the next free object in the
slab:
slabp->inuse++;
((void**)(ac+1))[ac->avail++] =
slabp->s_mem + slabp->free * cachep->obj_size;
slabp->free = ((kmem_bufctl_t*)(slabp+1))[slabp->free];Inserts, if necessary, the depleted slab in the proper
list, either the slab_full
or the slab_partial
list.
此时,添加到本地缓存的指针数量存储在该ac->avail字段中:该函数减少相同数量的结构free_objects
体的字段kmem_list3
,以指定该对象不再空闲。
At this point, the number of pointers added to the local
cache is stored in the ac->avail field: the function
decreases the free_objects
field of the kmem_list3
structure of the same amount to specify that the objects are no
longer free.
发布cachep->spinlock.
Releases the cachep->spinlock.
如果该ac->avail字段现在大于 0(发生了某些缓存重新填充),则会将该ac->touched字段设置为 1 并返回最后插入到本地缓存中的自由对象指针:
返回 ((void**)(ac+1))[--ac->avail];
If the ac->avail field
is now greater than 0 (some cache refilling took place), it sets
the ac->touched field to 1
and returns the free object pointer that was last inserted in the
local cache:
return ((void**)(ac+1))[--ac->avail];
否则,不会发生缓存重新填充:调用cache_grow()以获取新的平板,从而获取新的空闲对象。
Otherwise, no cache refilling took place: invokes cache_grow() to get a new slab, and thus
new free objects.
如果cache_grow()失败则返回NULL;否则返回步骤1重复该过程。
If cache_grow() fails, it
returns NULL; otherwise it goes
back to step 1 to repeat the procedure.
该kmem_cache_free(
)函数释放先前由slab分配器分配给某个内核函数的对象。它的参数是cachep,缓存描述符的地址,和objp,要释放的对象的地址:
The kmem_cache_free(
) function releases an object previously allocated by the
slab allocator to some kernel function. Its parameters are cachep, the address of the cache descriptor,
and objp, the address of the object
to be released:
无效 kmem_cache_free(kmem_cache_t *cachep, 无效 *objp)
{
无符号长标志;
结构数组缓存 *ac;
local_irq_save(标志);
ac = cachep->数组[smp_processor_id()];
if (ac->可用 == ac->限制)
cache_flusharray(cachep, ac);
((void**)(ac+1))[ac->avail++] = objp;
local_irq_restore(标志);
}void kmem_cache_free(kmem_cache_t *cachep, void *objp)
{
unsigned long flags;
struct array_cache *ac;
local_irq_save(flags);
ac = cachep->array[smp_processor_id()];
if (ac->avail == ac->limit)
cache_flusharray(cachep, ac);
((void**)(ac+1))[ac->avail++] = objp;
local_irq_restore(flags);
}该函数首先检查本地缓存是否有空间容纳指向空闲对象的附加指针。如果是,则将该指针添加到本地缓存并且函数返回。否则,它首先调用
cache_flusharray( )以耗尽本地缓存,然后将指针添加到本地缓存。
The function checks first whether the local cache has room for
an additional pointer to a free object. If so, the pointer is added to
the local cache and the function returns. Otherwise it first invokes
cache_flusharray( ) to deplete the
local cache and then adds the pointer to the local cache.
该cache_flusharray( )
函数执行以下操作:
The cache_flusharray( )
function performs the following operations:
获取cachep->spinlock自旋锁。
Acquires the cachep->spinlock spin lock.
ac->batchcount如果slab缓存包括共享本地缓存,并且共享本地缓存尚未满,则它通过向上移动到CPU本地缓存的指针来重新填充共享本地缓存。然后,跳转到步骤 4。
If the slab cache includes a shared local cache, and if the
shared local cache is not already full, it refills the shared
local cache by moving up to ac->batchcount pointers from the
CPU's local cache. Then, it jumps to step 4.
调用该函数将当前包含在本地缓存中的对象free_block( )
返回给slab分配器。ac->batchcount对于地址处的每个对象objp,该函数执行以下步骤:
增加lists.free_objects缓存描述符的字段。
确定包含该对象的slab描述符的地址:
labp = (structlab*)(virt_to_page(objp)->lru.prev);
(记住lru.prevslab页的描述符的字段指向对应的slab描述符。)
从其slab缓存列表中删除slab描述符( 或cachep->lists.slabs_partial)
cachep->lists.slabs_full。
计算slab内对象的索引:
objnr = (objp-slabp->s_mem)/cachep->objsize;
将 的当前值存储在对象描述符中
slabp->free,并放入
slabp->free该对象的索引(最后释放的对象将是第一个再次分配的对象):
((kmem_bufctl_t *)(slabp+1))[objnr] = scrapp->free; slabp->free = objnr;
减小slabp->inuse磁场。
如果slabp->inuse等于0——slab中的所有对象都是空闲的——并且整个slab缓存中的空闲对象的数量( cachep->lists.free_objects)大于字段中存储的限制cachep->free_limit,则该函数将该slab的页帧释放到分区页框分配器:
cachep->lists.free_objects -=cachep->num; lab_destroy(cachep,slabp);
该字段中存储的值cachep->free_limit通常等于cachep->num+ (1+ N ) × cachep->batchcount,其中
N表示系统的 CPU 数量。
否则,如果slab->inuse等于0,但整个slab缓存中的空闲对象数量小于cachep->free_limit,则将slab描述符插入列表中cachep->lists.slabs_free
。
最后,如果slab->inuse大于零,则该slab被部分填充,因此该函数将slab描述符插入列表中cachep->lists.slabs_partial
。
Invokes the free_block( )
function to give back to the slab allocator up to ac->batchcount objects currently
included in the local cache. For each object at address objp, the function executes the
following steps:
Increases the lists.free_objects field of the
cache descriptor.
Determines the address of the slab descriptor containing the object:
slabp = (struct slab *)(virt_to_page(objp)->lru.prev);
(Remember that the lru.prev field of the descriptor of
the slab page points to the corresponding slab
descriptor.)
Removes the slab descriptor from its slab cache list
(either cachep->lists.slabs_partial or
cachep->lists.slabs_full).
Computes the index of the object inside the slab:
objnr = (objp - slabp->s_mem) / cachep->objsize;
Stores in the object descriptor the current value of the
slabp->free, and puts in
slabp->free the index of
the object (the last released object will be the first object
to be allocated again):
((kmem_bufctl_t *)(slabp+1))[objnr] = slabp->free; slabp->free = objnr;
Decreases the slabp->inuse field.
If slabp->inuse is
equal to zero—all objects in the slab are free—and the number
of free objects in the whole slab cache (cachep->lists.free_objects) is
greater than the limit stored in the cachep->free_limit field, then
the function releases the slab's page frame(s) to the zoned
page frame allocator:
cachep->lists.free_objects -= cachep->num; slab_destroy(cachep, slabp);
The value stored in the cachep->free_limit field is
usually equal to cachep->num+
(1+N) × cachep->batchcount, where
N denotes the number of CPUs of the
system.
Otherwise, if slab->inuse is equal to zero but
the number of free objects in the whole slab cache is less
than cachep->free_limit,
it inserts the slab descriptor in the cachep->lists.slabs_free
list.
Finally, if slab->inuse is greater than zero,
the slab is partially filled, so the function inserts the slab
descriptor in the cachep->lists.slabs_partial
list.
释放cachep->spinlock自旋锁。
Releases the cachep->spinlock spin lock.
avail通过减去移动到共享本地缓存或释放到slab分配器的对象数量来更新本地缓存描述符的字段。
Updates the avail field
of the local cache descriptor by subtracting the number of objects
moved to the shared local cache or released to the slab
allocator.
将本地缓存中的所有有效指针移动到本地缓存数组的开头。这一步是必要的,因为第一个对象指针已从本地缓存中删除,因此剩余的必须向上移动。
Moves all valid pointers in the local cache at the beginning of the local cache's array. This step is necessary because the first object pointers have been removed from the local cache, thus the remaining ones must be moved up.
正如前面“伙伴系统算法”部分所述,对内存区域的不频繁请求是通过一组通用缓存来处理的,这些缓存的对象具有几何分布的大小,范围从最小 32 字节到最大 131,072 字节。
As stated earlier in the section "The Buddy System Algorithm," infrequent requests for memory areas are handled through a group of general caches whose objects have geometrically distributed sizes ranging from a minimum of 32 to a maximum of 131,072 bytes.
该类型的对象是通过调用该kmalloc( )函数来获取的,本质上等价于下面的代码片段:
Objects of this type are obtained by invoking the kmalloc( ) function, which is essentially
equivalent to the following code fragment:
void * kmalloc(size_t 大小, int 标志)
{
结构cache_sizes *csizep = malloc_sizes;
kmem_cache_t*cachep;
for (; csizep->cs_size; csizep++) {
if (大小 > csizep->cs_size)
继续;
if (标志 & _ _GFP_DMA)
cachep = csizep->cs_dmacachep;
别的
cachep = csizep->cs_cachep;
返回 kmem_cache_alloc(cachep, 标志);
}
返回空值;
}void * kmalloc(size_t size, int flags)
{
struct cache_sizes *csizep = malloc_sizes;
kmem_cache_t * cachep;
for (; csizep->cs_size; csizep++) {
if (size > csizep->cs_size)
continue;
if (flags & _ _GFP_DMA)
cachep = csizep->cs_dmacachep;
else
cachep = csizep->cs_cachep;
return kmem_cache_alloc(cachep, flags);
}
return NULL;
}该函数使用该malloc_sizes表来查找最接近所请求大小的 2 次幂大小。然后,它调用kmem_cache_alloc( )分配对象,将可用于 ISA DMA 的页帧的高速缓存描述符或“普通”页帧的高速缓存描述符传递给它,具体取决于调用者是否指定了该标志_ _GFP_DMA。
The function uses the malloc_sizes table to locate the nearest
power-of-2 size to the requested size. It then calls kmem_cache_alloc( ) to allocate the object,
passing to it either the cache descriptor for the page frames usable
for ISA DMA or the cache descriptor for the "normal" page frames,
depending on whether the caller specified the _ _GFP_DMA flag.
通过调用获得的对象kmalloc(
)可以通过调用来释放kfree(
):
Objects obtained by invoking kmalloc(
) can be released by calling kfree(
):
无效kfree(常量无效* objp)
{
kmem_cache_t * c;
无符号长标志;
如果(!objp)
返回;
local_irq_save(标志);
c = (kmem_cache_t *)(virt_to_page(objp)->lru.next);
kmem_cache_free(c, (void *)objp);
local_irq_restore(标志);
}void kfree(const void *objp)
{
kmem_cache_t * c;
unsigned long flags;
if (!objp)
return;
local_irq_save(flags);
c = (kmem_cache_t *)(virt_to_page(objp)->lru.next);
kmem_cache_free(c, (void *)objp);
local_irq_restore(flags);
}lru.next通过读取包含内存区域的第一个页帧的描述符的子字段来识别正确的高速缓存描述符
。通过调用 释放内存区域kmem_cache_free(
)。
The proper cache descriptor is identified by reading the
lru.next subfield of the descriptor
of the first page frame containing the memory area. The memory area is
released by invoking kmem_cache_free(
).
内存池是Linux 2.6 的一个新特性。基本上,内存池允许内核组件(例如块设备子系统)分配一些动态内存,仅在内存不足的紧急情况下使用。
Memory pools are a new feature of Linux 2.6. Basically, a memory pool allows a kernel component—such as the block device subsystem—to allocate some dynamic memory to be used only in low-on-memory emergencies.
内存池不应与前面部分“保留页帧池”中描述的保留页帧相混淆。事实上,这些页框只能用于满足中断处理程序或关键区域内发出的原子内存分配请求。相反,内存池是动态内存的保留,只能由特定的内核组件使用,即池的“所有者”。所有者通常不使用保留;但是,如果动态内存变得如此稀缺,以至于所有常见的内存分配请求都注定会失败,那么内核组件可以调用特殊内存池作为最后的手段因此,创建内存池类似于保留手头的罐头食品,只有在没有新鲜食物时才使用开罐器。
Memory pools should not be confused with the reserved page frames described in the earlier section "The Pool of Reserved Page Frames." In fact, those page frames can be used only to satisfy atomic memory allocation requests issued by interrupt handlers or inside critical regions. Instead, a memory pool is a reserve of dynamic memory that can be used only by a specific kernel component, namely the "owner" of the pool. The owner does not normally use the reserve; however, if dynamic memory becomes so scarce that all usual memory allocation requests are doomed to fail, the kernel component can invoke, as a last resort, special memory pool functions that dip in the reserve and get the memory needed. Thus, creating a memory pool is similar to keeping a reserve of canned foods on hand and using a can opener only when no fresh food is available.
通常,内存池堆叠在slab分配器上——也就是说,它用于保留slab对象。然而,一般来说,内存池可用于分配各种动态内存,从整个页框到分配的小内存区域
kmalloc()。因此,我们一般将内存池处理的内存单元称为“内存元素”。
Often, a memory pool is stacked over the slab allocator—that is,
it is used to keep a reserve of slab objects. Generally speaking,
however, a memory pool can be used to allocate every kind of dynamic
memory, from whole page frames to small memory areas allocated with
kmalloc(). Therefore, we will
generically refer to the memory units handled by a memory pool as
"memory elements."
内存池由一个mempool_t对象来描述,其字段如
表8-12所示。
A memory pool is described by a mempool_t object, whose fields are shown in
Table 8-12.
表 8-12。mempool_t 对象的字段
Table 8-12. The fields of the mempool_t object
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 自旋锁保护对象字段 Spin lock protecting the object fields |
| | 内存池中的最大元素数 Maximum number of elements in the memory pool |
| | 当前内存池中元素的数量 Current number of elements in the memory pool |
| | 指向保留元素的指针数组的指针 Pointer to an array of pointers to the reserved elements |
| | 池所有者可以使用私人数据 Private data available to the pool's owner |
| | 分配元素的方法 Method to allocate an element |
| | 释放元素的方法 Method to free an element |
| | 内存池为空时使用的等待队列 Wait queue used when the memory pool is empty |
该min_nr字段存储内存池中元素的初始数量。换句话说,该字段存储的值代表内存池的所有者确定从内存分配器获得的内存元素的数量。该curr_nr字段始终小于或等于min_nr,存储当前包含在内存池中的内存元素的数量。内存元素本身由指针数组引用,其地址存储在字段中
elements。
The min_nr field stores the
initial number of elements in the memory pool. In other words, the
value stored in this field represents the number of memory elements
that the owner of the memory pool is sure to obtain from the memory
allocator. The curr_nr field, which
is always lower than or equal to min_nr, stores the number of memory elements
currently included in the memory pool. The memory elements themselves
are referenced by an array of pointers, whose address is stored in the
elements field.
和方法与底层内存分配器交互,分别获取和释放内存元素alloc。free这两种方法都可以是拥有内存池的内核组件提供的自定义函数。
The alloc and free methods interface with the underlying
memory allocator to get and release a memory element, respectively.
Both methods may be custom functions provided by the kernel component
that owns the memory pool.
当内存元素是slab对象时,alloc和free方法通常由
mempool_alloc_slab( )和mempool_free_slab( )函数实现,它们分别只调用kmem_cache_alloc( )和
kmem_cache_free( )函数。在这种情况下,对象pool_data的字段mempool_t存储slab缓存描述符的地址。
When the memory elements are slab objects, the alloc and free methods are commonly implemented by the
mempool_alloc_slab( ) and mempool_free_slab( ) functions, which just
invoke the kmem_cache_alloc( ) and
kmem_cache_free( ) functions,
respectively. In this case, the pool_data field of the mempool_t object stores the address of the
slab cache descriptor.
该mempool_create( )
函数创建一个新的内存池;它接收内存元素的数量、实现和方法min_nr的函数的地址以及字段的可选值
。
该函数为对象和指向内存元素的指针数组分配内存,然后重复调用该方法以获取内存元素。相反,该
函数释放池中的所有内存元素,然后释放元素数组和
对象本身。allocfreepool_datamempool_tallocmin_nrmempool_destroy( )mempool_t
The mempool_create( )
function creates a new memory pool; it receives the number of memory
elements min_nr, the addresses of
the functions that implement the alloc and free methods, and an optional value for the
pool_data field. The function
allocates memory for the mempool_t
object and the array of pointers to the memory elements, then
repeatedly invokes the alloc method
to get the min_nr memory elements.
Conversely, the mempool_destroy( )
function releases all memory elements in the pool, then releases the
array of elements and the mempool_t
object themselves.
为了从内存池中分配一个元素,内核调用该函数,并将对象的地址和内存分配标志mempool_alloc( )传递给它(参见
本章前面的表8-5
和表8-6 )。本质上,该函数尝试通过调用从底层内存分配器分配内存元素
mempool_talloc方法,根据指定为参数的内存分配标志。如果分配成功,该函数将返回获得的内存元素,而不触及内存池。否则,如果分配失败,则从内存池中取出内存元素。当然,在内存不足的情况下进行过多的分配可能会耗尽内存池:在这种情况下,如果_ _GFP_WAIT
未设置该标志,mempool_alloc()
则会阻塞当前进程,直到将内存元素释放到内存池为止。
To allocate an element from a memory pool, the kernel invokes
the mempool_alloc( ) function,
passing to it the address of the mempool_t object and the memory allocation
flags (see Table 8-5
and Table 8-6
earlier in this chapter). Essentially, the function tries to allocate
a memory element from the underlying memory allocator by invoking the
alloc method, according to the
memory allocation flags specified as parameters. If the allocation
succeeds, the function returns the memory element obtained, without
touching the memory pool. Otherwise, if the allocation fails, the
memory element is taken from the memory pool. Of course, too many
allocations in a low-on-memory condition can exhaust the memory pool:
in this case, if the _ _GFP_WAIT
flag is not set, mempool_alloc()
blocks the current process until a memory element is released to the
memory pool.
相反,要将元素释放到内存池,内核会调用该mempool_free( )
函数。如果内存池未满(curr_min小于min_nr),则该函数将元素添加到内存池中。否则,mempool_free(
)调用该free方法将元素释放到底层内存分配器。
Conversely, to release an element to a memory pool, the kernel
invokes the mempool_free( )
function. If the memory pool is not full (curr_min is smaller than min_nr), the function adds the element to
the memory pool. Otherwise, mempool_free(
) invokes the free method
to release the element to the underlying memory allocator.
[ * ]请注意,不可能从ZONE_HIGHMEM内存区域分配页框,因为该kmem_getpages( )
函数返回由该函数产生的线性地址page_address( );正如本章前面的“高内存页帧的内核映射”部分所述,此函数返回NULL未映射的高内存页帧。
[*] Notice that it is not possible to allocate page frames from
the ZONE_HIGHMEM memory zone,
because the kmem_getpages( )
function returns the linear address yielded by the page_address( ) function; as explained
in the section "Kernel
Mappings of High-Memory Page Frames" earlier in this
chapter, this function returns NULL for unmapped high-memory page
frames.
[ * ]正如我们将在第 17 章中提到的,该lru字段也被页框回收算法使用。
[*] As we'll in Chapter
17, the lru field is
also used by the page frame reclaiming algorithm.
我们已经知道,最好将内存区域映射到连续的页框集合中,从而更好地利用缓存并实现更低的平均内存访问时间。然而,如果对内存区域的请求不频繁,则考虑基于通过连续线性地址访问的非连续页帧的分配方案是有意义的。这种模式的主要优点是避免外部碎片,而缺点是需要摆弄内核页表。显然,非连续内存区域的大小必须是 4,096 的倍数。Linux 以多种方式使用非连续内存区域 - 例如,为活动交换区域分配数据结构(请参阅第 17 章中的“激活和停用交换区域”部分),为模块分配空间(请参阅附录 B),或者为某些 I/O 驱动程序分配缓冲区。此外,非连续内存区域还提供了另一种利用高内存页帧的方法(请参阅后面的“分配非连续内存区域”部分)。
We already know that it is preferable to map memory areas into sets of contiguous page frames, thus making better use of the cache and achieving lower average memory access times. Nevertheless, if the requests for memory areas are infrequent, it makes sense to consider an allocation scheme based on noncontiguous page frames accessed through contiguous linear addresses . The main advantage of this schema is to avoid external fragmentation, while the disadvantage is that it is necessary to fiddle with the kernel Page Tables. Clearly, the size of a noncontiguous memory area must be a multiple of 4,096. Linux uses noncontiguous memory areas in several ways — for instance, to allocate data structures for active swap areas (see the section "Activating and Deactivating a Swap Area" in Chapter 17), to allocate space for a module (see Appendix B), or to allocate buffers to some I/O drivers. Furthermore, noncontiguous memory areas provide yet another way to make use of high memory page frames (see the later section "Allocating a Noncontiguous Memory Area").
PAGE_OFFSET
要找到线性地址的空闲范围,我们可以查看从(通常0xc0000000是第四个千兆字节的开头)开始的区域。图 8-7显示了第四 GB 线性地址的使用方式:
To find a free range of linear addresses, we can look in
the area starting from PAGE_OFFSET
(usually 0xc0000000, the beginning
of the fourth gigabyte). Figure 8-7 shows how the
fourth gigabyte linear addresses are used:
该区域的开头包括映射 RAM 的前 896 MB 的线性地址(请参阅第 2 章中的“进程页表”部分);与直接映射的物理内存末尾对应的线性地址存储在变量中。high_memory
The beginning of the area includes the linear addresses that
map the first 896 MB of RAM (see the section "Process Page Tables"
in Chapter 2); the
linear address that corresponds to the end of the directly mapped
physical memory is stored in the high_memory variable.
该区域的末尾包含固定映射的线性地址(参见第 2 章中的“固定映射线性地址”部分)。
The end of the area contains the fix-mapped linear addresses (see the section "Fix-Mapped Linear Addresses" in Chapter 2).
从我们开始,我们找到用于高内存页帧的持久内核映射的线性地址(请参阅本章前面的“高内存页帧的内核映射PKMAP_BASE”
部分)。
Starting from PKMAP_BASE
we find the linear addresses used for the persistent kernel
mapping of high-memory page frames (see the section "Kernel Mappings of
High-Memory Page Frames" earlier in this chapter).
剩余的线性地址可用于非连续的存储区域。VMALLOC_OFFSET在物理内存映射末尾和第一个内存区域之间插入大小为8MB(宏)的安全间隔;其目的是“捕获”越界内存访问。出于同样的原因,插入了大小为 4 KB 的附加安全间隔来分隔不连续的内存区域。
The remaining linear addresses can be used for noncontiguous
memory areas. A safety interval of size 8 MB (macro VMALLOC_OFFSET) is inserted between the
end of the physical memory mapping and the first memory area; its
purpose is to "capture" out-of-bounds memory accesses. For the
same reason, additional safety intervals of size 4 KB are inserted
to separate noncontiguous memory areas.
该VMALLOC_START宏定义了为非连续内存区域保留的线性空间的起始地址,同时VMALLOC_END定义了其结束地址。
The VMALLOC_START macro
defines the starting address of the linear space reserved for
noncontiguous memory areas, while VMALLOC_END defines its ending
address.
每个非连续内存区域都与一个类型为 的描述符相关联vm_struct,其字段列于表 8-13中。
Each noncontiguous memory area is associated with a
descriptor of type vm_struct, whose
fields are listed in Table
8-13.
表 8-13。vm_struct 描述符的字段
Table 8-13. The fields of the vm_struct descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 该区域的第一个存储单元的线性地址 Linear address of the first memory cell of the area |
| | 区域大小加上 4,096(区域间安全间隔) Size of the area plus 4,096 (inter-area safety interval) |
| | 非连续内存区域映射的内存类型 Type of memory mapped by the noncontiguous memory area |
| | |
无符号整数 unsigned int | nr_pages nr_pages | 该区域填充的页数 Number of pages filled by the area |
无符号长 unsigned long | 物理地址 phys_addr | 设置为 0,除非已创建该区域来映射硬件设备的 I/O 共享内存 Set to 0 unless the area has been created to map the I/O shared memory of a hardware device |
结构体vm_struct * struct vm_struct * | 下一个 next | 指向下一个 Pointer to next |
这些描述符通过字段插入到一个简单的列表中
next;列表的第一个元素的地址存储在vmlist变量中。对此列表的访问受到vmlist_lock读/写自旋锁的保护。该
flags字段标识该区域映射的内存类型:VM_ALLOC
对于通过 获得的页面vmalloc(
),VM_MAP对于通过 映射的已分配页面vmap()(参见下一节),以及VM_IOREMAP对于通过 映射的硬件设备的板载内存ioremap(
)(参见章节13)。
These descriptors are inserted in a simple list by means of the
next field; the address of the
first element of the list is stored in the vmlist variable. Accesses to this list are
protected by means of the vmlist_lock read/write spin lock. The
flags field identifies the type of
memory mapped by the area: VM_ALLOC
for pages obtained by means of vmalloc(
), VM_MAP for already
allocated pages mapped by means of vmap() (see the next section), and VM_IOREMAP for on-board memory of hardware
devices mapped by means of ioremap(
) (see Chapter
13).
该get_vm_area( )函数查找VMALLOC_START和之间的自由线性地址范围VMALLOC_END。size该函数作用于两个参数:要创建的内存区域的大小 ( )(以字节为单位),以及flag指定区域类型的标志 ( )(见上文)。执行的步骤如下:
The get_vm_area( ) function
looks for a free range of linear addresses between VMALLOC_START and VMALLOC_END. This function acts on two
parameters: the size (size) in
bytes of the memory region to be created, and a flag (flag) specifying the type of region (see
above). The steps performed are the following:
调用kmalloc( )以获取类型为新描述符的内存区域vm_struct。
Invokes kmalloc( ) to
obtain a memory area for the new descriptor of type vm_struct.
获取vmlist_lock写入锁并扫描类型描述符列表,vm_struct查找至少包含size + 4096地址的线性地址的空闲范围(4096 是内存区域之间的安全间隔的大小)。
Gets the vmlist_lock lock
for writing and scans the list of descriptors of type vm_struct looking for a free range of
linear addresses that includes at least size + 4096 addresses (4096 is the size
of the safety interval between the memory areas).
如果存在这样的间隔,则该函数初始化描述符的字段,释放锁vmlist_lock,并通过返回非连续内存区域的初始地址来终止。
If such an interval exists, the function initializes the
fields of the descriptor, releases the vmlist_lock lock, and terminates by
returning the initial address of the noncontiguous memory
area.
否则,get_vm_area( )
释放之前获得的描述符,释放vmlist_lock锁,并返回NULL。
Otherwise, get_vm_area( )
releases the descriptor obtained previously, releases the vmlist_lock lock, and returns NULL.
该vmalloc( )
函数向内核分配一个不连续的内存区域。该参数size表示请求区域的大小。如果该函数能够满足请求,则返回新区域的初始线性地址;否则,它返回一个NULL指针:
The vmalloc( )
function allocates a noncontiguous memory area to the kernel. The
parameter size denotes the size of
the requested area. If the function is able to satisfy the request, it
then returns the initial linear address of the new area; otherwise, it
returns a NULL pointer:
void * vmalloc(无符号长尺寸)
{
struct vm_struct *区域;
结构页面**页面;
无符号整型数组大小,i;
大小 = (大小 + PAGE_SIZE - 1) & PAGE_MASK;
区域 = get_vm_area(大小, VM_ALLOC);
如果(!区域)
返回空值;
区域->nr_pages = 大小>> PAGE_SHIFT;
array_size = (area->nr_pages * sizeof(struct page *));
区域->页面 = 页 = kmalloc(array_size, GFP_KERNEL);
如果(!area_pages){
删除_vm_area(区域->地址);
kfree(区域);
返回空值;
}
memset(区域->页面, 0, array_size);
for (i=0; i<区域->nr_pages; i++) {
区域->页面[i] = alloc_page(GFP_KERNEL|_ _GFP_HIGHMEM);
if (!area->pages[i]) {
区域->nr_pages = i;
失败:vfree(区域->地址);
返回空值;
}
}
if (map_vm_area(区域, _ _pgprot(0x63), &pages))
走向失败;
返回区域->地址;
}void * vmalloc(unsigned long size)
{
struct vm_struct *area;
struct page **pages;
unsigned int array_size, i;
size = (size + PAGE_SIZE - 1) & PAGE_MASK;
area = get_vm_area(size, VM_ALLOC);
if (!area)
return NULL;
area->nr_pages = size >> PAGE_SHIFT;
array_size = (area->nr_pages * sizeof(struct page *));
area->pages = pages = kmalloc(array_size, GFP_KERNEL);
if (!area_pages) {
remove_vm_area(area->addr);
kfree(area);
return NULL;
}
memset(area->pages, 0, array_size);
for (i=0; i<area->nr_pages; i++) {
area->pages[i] = alloc_page(GFP_KERNEL|_ _GFP_HIGHMEM);
if (!area->pages[i]) {
area->nr_pages = i;
fail: vfree(area->addr);
return NULL;
}
}
if (map_vm_area(area, _ _pgprot(0x63), &pages))
goto fail;
return area->addr;
}该函数首先将参数值向上舍入size为 4,096(页框大小)的倍数。然后vmalloc( )
调用get_vm_area( ),它创建一个新的描述符并返回分配给内存区域的线性地址。描述符的字段flags用标志初始化VM_ALLOC,这意味着非连续的页帧将通过该vmalloc( )
函数映射到线性地址范围。然后该vmalloc( )
函数调用kmalloc( )来请求一组足够大的连续页框以包含页面描述符指针数组。memset( )调用该函数将所有这些指针设置为NULL。接下来是
alloc_page( )函数被重复调用,对于每个nr_pages区域一次,分配一个页框并将相应页描述符的地址存储在数组中area->pages。请注意,使用area->pages数组是必要的,因为页框可能属于ZONE_HIGHMEM内存区域,因此现在它们不一定映射到线性地址。
The function starts by rounding up the value of the size parameter to a multiple of 4,096 (the
page frame size). Then vmalloc( )
invokes get_vm_area( ), which
creates a new descriptor and returns the linear addresses assigned to
the memory area. The flags field of
the descriptor is initialized with the VM_ALLOC flag, which means that the
noncontiguous page frames will be mapped into a linear address range
by means of the vmalloc( )
function. Then the vmalloc( )
function invokes kmalloc( ) to
request a group of contiguous page frames large enough to contain an
array of page descriptor pointers. The memset( ) function is invoked to set all
these pointers to NULL. Next the
alloc_page( ) function is called
repeatedly, once for each of the nr_pages of the region, to allocate a page
frame and store the address of the corresponding page descriptor in
the area->pages array. Observe
that using the area->pages array
is necessary because the page frames could belong to the ZONE_HIGHMEM memory zone, thus right now
they are not necessarily mapped to a linear address.
现在是棘手的部分。至此,已经获得了新的连续线性地址区间,并分配了一组不连续的页框来映射这些线性地址。最后一个关键步骤包括摆弄内核使用的页表条目,以指示分配给非连续内存区域的每个页帧现在与包含在由 产生的连续线性地址间隔中的线性地址相关联vmalloc( )。这就是
map_vm_area( )作用。
Now comes the tricky part. Up to this point, a fresh interval of
contiguous linear addresses has been obtained and a group of
noncontiguous page frames has been allocated to map these linear
addresses. The last crucial step consists of fiddling with the page
table entries used by the kernel to indicate that each page frame
allocated to the noncontiguous memory area is now associated with a
linear address included in the interval of contiguous linear addresses
yielded by vmalloc( ). This is what
map_vm_area( ) does.
该map_vm_area( )函数使用三个参数:
The map_vm_area( ) function
uses three parameters:
areaareavm_struct指向区域描述符的指针。
The pointer to the vm_struct descriptor of the
area.
protprot分配的页框的保护位。它始终设置为0x63,对应于Present、
Accessed、Read/Write和Dirty。
The protection bits of the allocated page frames. It is
always set to 0x63, which
corresponds to Present,
Accessed, Read/Write, and Dirty.
pagespages指向页面描述符的指针数组的变量的地址(因此,struct page
***用作数据类型!)。
The address of a variable pointing to an array of pointers
to page descriptors (thus, struct page
*** is used as the data type!).
该函数首先将区域开始和结束的线性地址分别分配给address和end局部变量:
The function starts by assigning the linear addresses of the
start and end of the area to the address and end local variables, respectively:
地址=区域->地址; end = 地址 + (区域->大小 - PAGE_SIZE);
address = area->addr; end = address + (area->size - PAGE_SIZE);
请记住,area->size
存储区域的实际大小加上 4 KB 区域间安全间隔。然后该函数使用该pgd_offset_k宏导出与该区域的初始线性地址相关的主内核页面全局目录中的条目;然后它获取内核页表自旋锁:
Remember that area->size
stores the actual size of the area plus the 4 KB inter-area safety
interval. The function then uses the pgd_offset_k macro to derive the entry in
the master kernel Page Global Directory related to the initial linear
address of the area; it then acquires the kernel Page Table spin
lock:
pgd = pgd_offset_k(地址); spin_lock(&init_mm.page_table_lock);
pgd = pgd_offset_k(address); spin_lock(&init_mm.page_table_lock);
然后该函数执行以下循环:
The function then executes the following cycle:
int ret = 0;
for (i = pgd_index(地址); i < pgd_index(end-1); i++) {
pud_t *pud = pud_alloc(&init_mm, pgd, 地址);
ret = -ENOMEM;
如果(!pud)
休息;
下一个 = (地址 + PGDIR_SIZE) & PGDIR_MASK;
if (下一个<地址||下一个>结束)
下一个=结束;
if (map_area_pud(pud, 地址, 下一个, prot, 页))
休息;
地址=下一个;
PGD++;
ret = 0;
}
spin_unlock(&init_mm.page_table_lock);
lush_cache_vmap((unsigned long)area->addr, end);
返回ret;int ret = 0;
for (i = pgd_index(address); i < pgd_index(end-1); i++) {
pud_t *pud = pud_alloc(&init_mm, pgd, address);
ret = -ENOMEM;
if (!pud)
break;
next = (address + PGDIR_SIZE) & PGDIR_MASK;
if (next < address || next > end)
next = end;
if (map_area_pud(pud, address, next, prot, pages))
break;
address = next;
pgd++;
ret = 0;
}
spin_unlock(&init_mm.page_table_lock);
flush_cache_vmap((unsigned long)area->addr, end);
return ret;在每个周期中,它首先调用pud_alloc( )为新区域创建一个页面上层目录,并将其物理地址写入内核页面全局目录的右侧条目中。然后它调用
map_area_pud( )分配与新页上层目录关联的所有页表。它将单个页面上层目录跨越的线性地址范围的大小(如果启用 PAE,则为常量2 30 ,否则为 2 22)添加到 的当前值,并增加指向页面全局目录的address指针。pgd
In each cycle, it first invokes pud_alloc( ) to create a Page Upper
Directory for the new area and writes its physical address in the
right entry of the kernel Page Global Directory. It then calls
map_area_pud( ) to allocate all the
page tables associated with the new Page Upper Directory. It adds the
size of the range of linear addresses spanned by a single Page Upper
Directory—the constant 230 if PAE is
enabled, 222 otherwise—to the current value
of address, and it increases the
pointer pgd to the Page Global
Directory.
重复该循环,直到建立所有引用非连续内存区域的页表条目。
The cycle is repeated until all Page Table entries referring to the noncontiguous memory area are set up.
该map_area_pud( )函数对页上层目录指向的所有页表执行类似的循环:
The map_area_pud( ) function
executes a similar cycle for all the page tables that a Page Upper
Directory points to:
做 {
pmd_t * pmd = pmd_alloc(&init_mm, pud, 地址);
如果(!pmd)
返回-ENOMEM;
if (map_area_pmd(pmd, 地址, 结束地址, prot, 页))
返回-ENOMEM;
地址 = (地址 + PUD_SIZE) & PUD_MASK;
普德++;
} while (地址 < 结束);do {
pmd_t * pmd = pmd_alloc(&init_mm, pud, address);
if (!pmd)
return -ENOMEM;
if (map_area_pmd(pmd, address, end-address, prot, pages))
return -ENOMEM;
address = (address + PUD_SIZE) & PUD_MASK;
pud++;
} while (address < end);该map_area_pmd( )函数对页面中间目录指向的所有页表执行类似的循环:
The map_area_pmd( ) function
executes a similar cycle for all the Page Tables that a Page Middle
Directory points to:
做 {
pte_t * pte = pte_alloc_kernel(&init_mm, pmd, 地址);
如果(!pte)
返回-ENOMEM;
if (map_area_pte(pte, 地址, 结束地址, prot, 页))
返回-ENOMEM;
地址 = (地址 + PMD_SIZE) & PMD_MASK;
PMD++;
} while (地址 < 结束);do {
pte_t * pte = pte_alloc_kernel(&init_mm, pmd, address);
if (!pte)
return -ENOMEM;
if (map_area_pte(pte, address, end-address, prot, pages))
return -ENOMEM;
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address < end);该函数(参见第 2 章中的“页表处理”
pte_alloc_kernel( )
部分)分配一个新的页表并更新页中间目录中的相应条目。接下来,
分配与页表中的条目相对应的所有页框。的值增加 2 22 — 单个页表所跨越的线性地址间隔的大小 — 并且重复该循环。map_area_pte( )address
The pte_alloc_kernel( )
function (see the section "Page Table Handling" in
Chapter 2) allocates a new
Page Table and updates the corresponding entry in the Page Middle
Directory. Next, map_area_pte( )
allocates all the page frames corresponding to the entries in the Page
Table. The value of address is
increased by 222—the size of the linear
address interval spanned by a single Page Table—and the cycle is
repeated.
主要循环为map_area_pte(
):
The main cycle of map_area_pte(
) is:
做 {
结构页 * 页 = **页;
set_pte(pte, mk_pte(页面, prot));
地址+= PAGE_SIZE;
pte++;
(*页)++;
} while (地址 < 结束);do {
struct page * page = **pages;
set_pte(pte, mk_pte(page, prot));
address += PAGE_SIZE;
pte++;
(*pages)++;
} while (address < end);page要映射的页框的页描述符地址是从地址 处的变量所指向的数组条目中读取的pages。set_pte新页框的物理地址通过和宏写入页表mk_pte。将常数 4,096(页框的长度)添加到 后重复该循环address。
The page descriptor address page of the page frame to be mapped is read
from the array's entry pointed to by the variable at address pages. The physical address of the new page
frame is written into the Page Table by the set_pte and mk_pte macros. The cycle is repeated after
adding the constant 4,096 (the length of a page frame) to address.
请注意,当前进程的页表未被 触及map_vm_area( )。因此,当内核模式下的进程访问非连续内存区域时,就会发生页面错误,因为该区域对应的进程页表中的条目为空。然而,页面错误处理程序根据主内核页表(即init_mm.pgd页面全局目录及其子页表;请参阅第 2 章中的“内核页表”
部分)检查有错误的线性地址)。一旦处理程序发现主内核页表包含该地址的非空条目,它将其值复制到相应进程的页表条目中并恢复进程的正常执行。该机制在第 9 章的“页面错误异常处理程序”部分中进行了描述。
Notice that the Page Tables of the current process are not
touched by map_vm_area( ).
Therefore, when a process in Kernel Mode accesses the noncontiguous
memory area, a Page Fault occurs, because the entries in the process's
Page Tables corresponding to the area are null. However, the Page
Fault handler checks the faulty linear address against the master
kernel Page Tables (which are init_mm.pgd Page Global Directory and its
child page tables; see the section "Kernel Page Tables" in
Chapter 2). Once the handler
discovers that a master kernel Page Table includes a non-null entry
for the address, it copies its value into the corresponding process's
Page Table entry and resumes normal execution of the process. This
mechanism is described in the section "Page Fault Exception
Handler" in Chapter
9.
除了该vmalloc( )
函数之外,该函数还可以分配非连续内存区域,这与和内存区域vmalloc_32( )非常相似,vmalloc( )但只是分配页框。ZONE_NORMALZONE_DMA
Beside the vmalloc( )
function, a noncontiguous memory area can be allocated by the vmalloc_32( ) function, which is very
similar to vmalloc( ) but only
allocates page frames from the ZONE_NORMAL and ZONE_DMA memory zones.
Linux 2.6还具有一个vmap(
)函数,该函数映射已经分配在非连续内存区域中的页帧:本质上,该函数接收指向页描述符的指针数组作为其参数,调用以获取新的描述符,然后调用以get_vm_area( )映射vm_struct页
map_vm_area( )帧。因此该函数类似于vmalloc( ),但它不分配页框。
Linux 2.6 also features a vmap(
) function, which maps page frames already allocated in a
noncontiguous memory area: essentially, this function receives as its
parameter an array of pointers to page descriptors, invokes get_vm_area( ) to get a new vm_struct descriptor, and then invokes
map_vm_area( ) to map the page
frames. The function is thus similar to vmalloc( ), but it does not allocate page
frames.
该函数释放由或vfree( )创建的非连续内存区域,而该函数释放由 所创建的内存区域。这两个函数都有一个参数——待释放区域的起始线性地址;它们都依赖该函数来完成实际工作。vmalloc( )vmalloc_32( )vunmap( )vmap( )_
_vunmap( )
The vfree( ) function
releases noncontiguous memory areas created by vmalloc( ) or vmalloc_32( ), while the vunmap( ) function releases memory areas
created by vmap( ). Both functions
have one parameter—the address of the initial linear address of the
area to be released; they both rely on the _
_vunmap( ) function to do the real work.
该函数接收两个参数:要释放的区域的初始线性地址的_ _vunmap( )地址,以及标志 ,如果应该将该区域中映射的页框释放到分区页框分配器(的调用),则设置该标志,否则清除(的调用)。该函数执行以下操作:addrdeallocate_pagesvfree( )vunmap( )
The _ _vunmap( ) function
receives two parameters: the address addr of the initial linear address of the
area to be released, and the flag deallocate_pages, which is set if the page
frames mapped in the area should be released to the zoned page frame
allocator (vfree( )'s invocation),
and cleared otherwise (vunmap( )'s
invocation). The function performs the following operations:
调用该remove_vm_area(
)函数获取描述符area的地址vm_struct,并清除非连续内存区域中线性地址对应的内核页表项。
Invokes the remove_vm_area(
) function to get the address area of the vm_struct descriptor and to clear the
kernel's page table entries corresponding to the linear address in
the noncontiguous memory area.
如果deallocate_pages
设置了该标志,它将扫描area->pages指向页面描述符的指针数组;对于数组的每个元素,调用函数
_ _free_page( )将页框释放到分区页框分配器。此外,执行kfree(area->pages)释放数组本身。
If the deallocate_pages
flag is set, it scans the area->pages array of pointers to the
page descriptor; for each element of the array, invokes the
_ _free_page( ) function to
release the page frame to the zoned page frame allocator.
Moreover, executes kfree(area->pages) to release the
array itself.
调用kfree(area)以释放vm_struct
描述符。
Invokes kfree(area) to
release the vm_struct
descriptor.
该remove_vm_area( )
函数执行以下循环:
The remove_vm_area( )
function performs the following cycle:
write_lock(&vmlist_lock);
for (p = &vmlist ; (tmp = *p) ; p = &tmp->下一个) {
if (tmp->addr == addr) {
unmap_vm_area(tmp);
*p = tmp->下一个;
休息;
}
}
write_unlock(&vmlist_lock);
返回tmp;write_lock(&vmlist_lock);
for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
if (tmp->addr == addr) {
unmap_vm_area(tmp);
*p = tmp->next;
break;
}
}
write_unlock(&vmlist_lock);
return tmp;该区域本身是通过调用来释放的unmap_vm_area( )。该函数作用于单个参数,即area指向vm_struct区域描述符的指针。它执行以下循环来反转 执行的操作
map_vm_area( ):
The area itself is released by invoking unmap_vm_area( ). This function acts on a
single parameter, namely a pointer area to the vm_struct descriptor of the area. It
executes the following cycle to reverse the actions performed by
map_vm_area( ):
地址=区域->地址;
结束=地址+区域->大小;
pgd = pgd_offset_k(地址);
for (i = pgd_index(地址); i <= pgd_index(end-1); i++) {
下一个 = (地址 + PGDIR_SIZE) & PGDIR_MASK;
if (下一个 <= 地址 || 下一个 > 结束)
下一个=结束;
unmap_area_pud(pgd, 地址, 下一个 - 地址);
地址=下一个;
PGD++;
}address = area->addr;
end = address + area->size;
pgd = pgd_offset_k(address);
for (i = pgd_index(address); i <= pgd_index(end-1); i++) {
next = (address + PGDIR_SIZE) & PGDIR_MASK;
if (next <= address || next > end)
next = end;
unmap_area_pud(pgd, address, next - address);
address = next;
pgd++;
}反过来,反转循环中unmap_area_pud( )
的动作:map_area_pud(
)
In turn, unmap_area_pud( )
reverses the actions of map_area_pud(
) in the cycle:
做 {
unmap_area_pmd(pud, 地址, 结束地址);
地址 = (地址 + PUD_SIZE) & PUD_MASK;
普德++;
} while (地址 && (地址 < 结束));do {
unmap_area_pmd(pud, address, end-address);
address = (address + PUD_SIZE) & PUD_MASK;
pud++;
} while (address && (address < end));该函数反转循环中unmap_area_pmd( )
的动作:map_area_pmd( )
The unmap_area_pmd( )
function reverses the actions of map_area_pmd( ) in the cycle:
做 {
unmap_area_pte(pmd, 地址, 结束地址);
地址 = (地址 + PMD_SIZE) & PMD_MASK;
PMD++;
} while (地址 < 结束);do {
unmap_area_pte(pmd, address, end-address);
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address < end);最后,反转循环中unmap_area_pte( )
的动作:map_area_pte(
)
Finally, unmap_area_pte( )
reverses the actions of map_area_pte(
) in the cycle:
做 {
pte_t 页面 = ptep_get_and_clear(pte);
地址+= PAGE_SIZE;
pte++;
if (!pte_none(页) && !pte_present(页))
printk("哎呀...交换出内核页表中的页面\n");
} while (地址 < 结束);do {
pte_t page = ptep_get_and_clear(pte);
address += PAGE_SIZE;
pte++;
if (!pte_none(page) && !pte_present(page))
printk("Whee... Swapped out page in kernel page table\n");
} while (address < end);在循环的每次迭代中,pte由宏将 by 指向的页表项设置为 0 ptep_get_and_clear。
In every iteration of the cycle, the page table entry pointed to
by pte is set to 0 by the ptep_get_and_clear macro.
至于vmalloc( ),内核修改了主内核页全局目录及其子页表的条目(参见第 2章“内核页表”
一节),但映射第四个 GB 的进程页表条目保持不变。这很好,因为内核永远不会回收以主内核页面全局目录为根的页面上层目录、页面中间目录和页表。
As for vmalloc( ), the kernel
modifies the entries of the master kernel Page Global Directory and
its child page tables (see the section "Kernel Page Tables" in
Chapter 2), but it leaves
unchanged the entries of the process page tables mapping the fourth
gigabyte. This is fine because the kernel never reclaims Page Upper
Directories, Page Middle Directories, and Page Tables rooted at the
master kernel Page Global Directory.
例如,假设内核模式下的进程访问了后来被释放的非连续内存区域。进程的页面全局目录条目等于主内核页面全局目录的相应条目,这要归功于第9章“页面错误异常处理程序”一节中解释的机制;它们指向相同的页上层目录、页中层目录和页表。这
unmap_area_pte( )函数仅清除页表的条目(不回收页表本身)。由于页表条目为空,进程对已释放的非连续内存区域的进一步访问将触发页面错误。然而,处理程序会认为这样的访问是一个错误,因为主内核页表不包含有效条目。
For instance, suppose that a process in Kernel Mode accessed a
noncontiguous memory area that later got released. The process's Page
Global Directory entries are equal to the corresponding entries of the
master kernel Page Global Directory, thanks to the mechanism explained
in the section "Page Fault
Exception Handler" in Chapter 9; they point to the same
Page Upper Directories, Page Middle Directories, and Page Tables. The
unmap_area_pte( ) function clears
only the entries of the page tables (without reclaiming the page
tables themselves). Further accesses of the process to the released
noncontiguous memory area will trigger Page Faults because of the null
page table entries. However, the handler will consider such accesses a
bug, because the master kernel page tables do not include valid entries.
如前一章所示,内核函数通过调用各种函数之一以相当简单的方式获取动态内存:_ _get_free_pages( )或者
alloc_pages( )从分区页框分配器获取页面,kmem_cache_alloc(
)或者kmalloc( )使用平板分配器来实现专用或通用目的对象,和vmalloc( )/或vmalloc_32( )获取不连续的内存区域。如果可以满足请求,则这些函数中的每一个都会返回一个页面描述符地址或一个标识所分配的动态内存区域的开头的线性地址。
As seen in the previous chapter, a kernel function gets dynamic
memory in a fairly straightforward manner by invoking one of a variety of
functions: _ _get_free_pages( ) or
alloc_pages( ) to get pages from the
zoned page frame allocator, kmem_cache_alloc(
) or kmalloc( ) to use the
slab allocator for specialized or general-purpose objects, and vmalloc( ) or vmalloc_32( ) to get a noncontiguous memory
area. If the request can be satisfied, each of these functions returns a
page descriptor address or a linear address identifying the beginning of
the allocated dynamic memory area.
这些简单的方法之所以有效有两个原因:
These simple approaches work for two reasons:
内核是操作系统中优先级最高的组件。如果内核函数请求动态内存,它必须有一个有效的理由来发出该请求,并且尝试推迟它是没有意义的。
The kernel is the highest-priority component of the operating system. If a kernel function makes a request for dynamic memory, it must have a valid reason to issue that request, and there is no point in trying to defer it.
内核信任自己。所有内核函数都被假定为无错误,因此内核不需要插入任何针对编程错误的保护。
The kernel trusts itself. All kernel functions are assumed to be error-free, so the kernel does not need to insert any protection against programming errors.
当给用户态进程分配内存时,情况就完全不同了:
When allocating memory to User Mode processes, the situation is entirely different:
动态内存的进程请求被认为是非紧急的。例如,当加载进程的可执行文件时,该进程不太可能在不久的将来寻址所有代码页。类似地,当进程调用malloc( )获取额外的动态内存时,并不意味着该进程将很快访问获得的所有额外内存。因此,作为一般规则,内核尝试推迟将动态内存分配给用户模式进程。
Process requests for dynamic memory are considered non-urgent.
When a process's executable file is loaded, for instance, it is
unlikely that the process will address all the pages of code in the
near future. Similarly, when a process invokes malloc( ) to get additional dynamic memory,
it doesn't mean the process will soon access all the additional memory
obtained. Thus, as a general rule, the kernel tries to defer
allocating dynamic memory to User Mode processes.
由于用户程序不可信,因此内核必须准备好捕获用户模式下的进程引起的所有寻址错误。
Because user programs cannot be trusted, the kernel must be prepared to catch all addressing errors caused by processes in User Mode.
正如本章所描述的,内核通过使用一种新的资源成功地延迟了对进程的动态内存分配。当用户模式进程请求动态内存时,它不会获得额外的页框;相反,它有权使用新的线性地址范围,这些地址成为其地址空间的一部分。该区间称为“存储区域”。
As this chapter describes, the kernel succeeds in deferring the allocation of dynamic memory to processes by using a new kind of resource. When a User Mode process asks for dynamic memory, it doesn't get additional page frames; instead, it gets the right to use a new range of linear addresses, which become part of its address space. This interval is called a "memory region."
在下一节中,我们讨论进程如何查看动态内存。然后,我们在“内存区域”部分中描述进程地址空间的基本组成部分。接下来,我们详细研究一下页面错误所扮演的角色延迟为进程分配页框的异常处理程序,并说明内核如何创建和删除整个进程地址空间。最后,我们讨论与地址空间管理相关的 API 和系统调用。
In the next section, we discuss how the process views dynamic memory. We then describe the basic components of the process address space in the section "Memory Regions." Next, we examine in detail the role played by the Page Fault exception handler in deferring the allocation of page frames to processes and illustrate how the kernel creates and deletes whole process address spaces. Last, we discuss the APIs and system calls related to address space management.
进程的地址空间由允许该进程使用的所有线性地址组成。每个进程看到一组不同的线性地址;一个进程使用的地址与另一进程使用的地址没有关系。正如我们稍后将看到的,内核可以通过添加或删除线性地址间隔来动态修改进程地址空间。
The address space of a process consists of all linear addresses that the process is allowed to use. Each process sees a different set of linear addresses; the address used by one process bears no relation to the address used by another. As we will see later, the kernel may dynamically modify a process address space by adding or removing intervals of linear addresses.
内核通过称为内存区域的资源来表示线性地址的间隔 ,其特征是初始线性地址、长度和一些访问权限。出于效率的考虑,内存区域的起始地址和长度都必须是4096的倍数,以便每个内存区域所标识的数据完全填满分配给它的页框。以下是进程获取新内存区域的一些典型情况:
The kernel represents intervals of linear addresses by means of resources called memory regions , which are characterized by an initial linear address, a length, and some access rights. For reasons of efficiency, both the initial address and the length of a memory region must be multiples of 4,096, so that the data identified by each memory region completely fills up the page frames allocated to it. Following are some typical situations in which a process gets new memory regions:
当用户在控制台输入命令时,shell进程会创建一个新进程来执行该命令。结果,一个新的地址空间以及一组内存区域被分配给新进程(请参阅本章后面的“创建和删除进程地址空间”部分;另请参阅第 20 章)。
When the user types a command at the console, the shell process creates a new process to execute the command. As a result, a fresh address space, and thus a set of memory regions, is assigned to the new process (see the section "Creating and Deleting a Process Address Space" later in this chapter; also, see Chapter 20).
正在运行的进程可能决定加载完全不同的程序。在这种情况下,进程ID保持不变,但是加载程序之前使用的内存区域被释放,并且一组新的内存区域被分配给进程(参见第20章中的“exec函数”部分)。
A running process may decide to load an entirely different program. In this case, the process ID remains unchanged, but the memory regions used before loading the program are released and a new set of memory regions is assigned to the process (see the section "The exec Functions" in Chapter 20).
正在运行的进程可以对文件(或文件的一部分)执行“内存映射”。在这种情况下,内核会为进程分配一个新的内存区域来映射文件(参见第16章中的“内存映射”部分)。
A running process may perform a "memory mapping" on a file (or on a portion of it). In such cases, the kernel assigns a new memory region to the process to map the file (see the section "Memory Mapping" in Chapter 16).
进程可以继续在其用户模式堆栈上添加数据,直到映射堆栈的内存区域中的所有地址都已被使用。在这种情况下,内核可能决定扩展该内存区域的大小(请参阅本章后面的“页面错误异常处理程序”部分)。
A process may keep adding data on its User Mode stack until all addresses in the memory region that map the stack have been used. In this case, the kernel may decide to expand the size of that memory region (see the section "Page Fault Exception Handler" later in this chapter).
进程可以创建 IPC 共享内存区域以与其他协作进程共享数据。在这种情况下,内核为进程分配一个新的内存区域来实现此构造(请参阅第 19 章中的“ IPC 共享内存”部分)。
A process may create an IPC-shared memory region to share data with other cooperating processes. In this case, the kernel assigns a new memory region to the process to implement this construct (see the section "IPC Shared Memory" in Chapter 19).
进程可以通过诸如 之类的函数扩展其动态区域(堆)malloc( )。因此,内核可能决定扩展分配给堆的内存区域的大小(请参阅本章后面的“管理堆”部分)。
A process may expand its dynamic area (the heap) through a
function such as malloc( ). As a
result, the kernel may decide to expand the size of the memory
region assigned to the heap (see the section "Managing the Heap" later
in this chapter).
表 9-1
说明了与前面提到的任务相关的一些系统调用。brk( )在本章末尾讨论,而其余的系统调用将在其他章节中描述。
Table 9-1
illustrates some of the system calls related to the previously mentioned
tasks. brk( ) is discussed at the end
of this chapter, while the remaining system calls are described in other
chapters.
表 9-1。与内存区域创建和删除相关的系统调用
Table 9-1. System calls related to memory region creation and deletion
系统调用 System call | 描述 Description |
|---|---|
| 更改进程的堆大小 Changes the heap size of the process |
| 加载新的可执行文件,从而更改进程地址空间 Loads a new executable file, thus changing the process address space |
| 终止当前进程并销毁其地址空间 Terminates the current process and destroys its address space |
| 创建一个新进程,从而创建一个新的地址空间 Creates a new process, and thus a new address space |
为文件创建内存映射,从而扩大进程地址空间 Creates a memory mapping for a file, thus enlarging the process address space | |
| 扩大或缩小内存区域 Expands or shrinks a memory region |
为文件创建非线性映射(参见第 16 章) Creates a non-linear mapping for a file (see Chapter 16) | |
| 销毁文件的内存映射,从而收缩进程地址空间 Destroys a memory mapping for a file, thus contracting the process address space |
| 附加共享内存区域 Attaches a shared memory region |
| 分离共享内存区域 Detaches a shared memory region |
正如我们将在后面的“页面错误异常处理程序”部分中看到的,内核必须识别进程当前拥有的内存区域(进程的地址空间),因为这允许页面错误异常处理程序有效区分导致其被调用的两种类型的无效线性地址:
As we'll see in the later section "Page Fault Exception Handler," it is essential for the kernel to identify the memory regions currently owned by a process (the address space of a process), because that allows the Page Fault exception handler to efficiently distinguish between two types of invalid linear addresses that cause it to be invoked:
那些由编程错误引起的。
Those caused by programming errors.
由缺页引起的;即使线性地址属于进程的地址空间,但与该地址对应的页框尚未分配。
Those caused by a missing page; even though the linear address belongs to the process's address space, the page frame corresponding to that address has yet to be allocated.
从进程的角度来看,后面的地址并不是无效的;内核利用引起的页面错误来实现按需分页:内核提供丢失的页框并让进程继续。
The latter addresses are not invalid from the process's point of view; the induced Page Faults are exploited by the kernel to implement demand paging : the kernel provides the missing page frame and lets the process continue.
所有与进程地址空间相关的信息都包含在一个称为
类型的内存描述符mm_struct的对象中。mm该对象由进程描述符的字段引用。表 9-2列出了内存描述符的字段
。
All information related to the process address space is
included in an object called the memory descriptor
of type mm_struct. This object is
referenced by the mm field of the
process descriptor. The fields of a memory descriptor are listed in
Table 9-2.
表 9-2。内存描述符的字段
Table 9-2. The fields of the memory descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向内存区域对象列表头部的指针 Pointer to the head of the list of memory region objects |
| | 指向内存区域对象红黑树根的指针 Pointer to the root of the red-black tree of memory region objects |
| | 指向最后引用的内存区域对象的指针 Pointer to the last referenced memory region object |
无符号长整型 (*)( ) unsigned long (*)( ) | 获取未映射的区域 get_unmapped_area | 在进程地址空间中搜索可用线性地址区间的方法 Method that searches an available linear address interval in the process address space |
空白 (*)( ) void (*)( ) | 取消映射区域 unmap_area | 释放线性地址区间时调用的方法 Method invoked when releasing a linear address interval |
无符号长 unsigned long | mmap_base mmap_base | 标识第一个分配的匿名内存区域或文件内存映射的线性地址(参见第20章“程序段和进程内存区域”部分) Identifies the linear address of the first allocated anonymous memory region or file memory mapping (see the section "Program Segments and Process Memory Regions" in Chapter 20) |
无符号长 unsigned long | 空闲区域缓存 free_area_cache | 内核将在进程地址空间中查找线性地址的空闲区间的地址 Address from which the kernel will look for a free interval of linear addresses in the process address space |
| | 指向页面全局目录的指针 Pointer to the Page Global Directory |
| | 二次使用计数器 Secondary usage counter |
| | 主要使用计数器 Main usage counter |
| | 内存区域数量 Number of memory regions |
| | 内存区域的读/写信号量 Memory regions' read/write semaphore |
| | 内存区域和页表的自旋锁 Memory regions' and Page Tables' spin lock |
| | 指向内存描述符列表中相邻元素的指针 Pointers to adjacent elements in the list of memory descriptors |
| | 可执行代码的起始地址 Initial address of executable code |
| | 可执行代码的最终地址 Final address of executable code |
| | 初始化数据的起始地址 Initial address of initialized data |
| | 初始化数据的最终地址 Final address of initialized data |
| | 堆的初始地址 Initial address of the heap |
| | 堆的当前最终地址 Current final address of the heap |
| | 用户态堆栈的起始地址 Initial address of User Mode stack |
| | 命令行参数的初始地址 Initial address of command-line arguments |
| | 命令行参数的最终地址 Final address of command-line arguments |
| | 环境变量的起始地址 Initial address of environment variables |
| | 环境变量的最终地址 Final address of environment variables |
| | 分配给进程的页框数 Number of page frames allocated to the process |
无符号长 unsigned long | 匿名RSS anon_rss | 分配给匿名内存映射的页框数 Number of page frames assigned to anonymous memory mappings |
| | 进程地址空间的大小(页数) Size of the process address space (number of pages) |
| | 无法换出的“锁定”页数(参见第 17 章) Number of "locked" pages that cannot be swapped out (see Chapter 17) |
无符号长 unsigned long | 共享虚拟机 shared_vm | 共享文件内存映射中的页数 Number of pages in shared file memory mappings |
无符号长 unsigned long | 执行虚拟机 exec_vm | 可执行内存映射中的页数 Number of pages in executable memory mappings |
无符号长 unsigned long | 堆栈虚拟机 stack_vm | 用户模式堆栈中的页数 Number of pages in the User Mode stack |
无符号长 unsigned long | 保留虚拟机 reserved_vm | 保留或特殊内存区域中的页数 Number of pages in reserved or special memory regions |
| | 内存区域的默认访问标志 Default access flags of the memory regions |
无符号长 unsigned long | nr_ptes nr_ptes | 该进程的页表数 Number of Page Tables of this process |
| | 当开始执行 ELF 程序时使用(参见第 20 章) Used when starting the execution of an ELF program (see Chapter 20) |
| | 指定进程是否可以生成内存核心转储的标志 Flag that specifies whether the process can produce a core dump of the memory |
| | 惰性 TLB 开关的位掩码(参见 第 2 章) Bit mask for lazy TLB switches (see Chapter 2) |
| | 指向特定于体系结构信息的表的指针(例如,LDT 在 80 - 86 平台中的地址) Pointer to table for architecture-specific information (e.g., LDT's address in 80 — 86 platforms) |
无符号长 unsigned long | 交换令牌时间 swap_token_time | 该流程何时有资格获得交换代币(请参阅第 17 章中的 “交换代币”部分) When this process will become eligible for having the swap token (see the section "The Swap Token" in Chapter 17) |
字符 char | 最近的页面 recent_pagein | 如果最近发生了主要页面错误则设置标志 Flag set if a major Page Fault has recently occurred |
整数 int | 核心服务员 core_waiters | 将进程地址空间的内容转储到核心文件的轻量级进程数(请参阅本章后面的“删除进程地址空间”部分) Number of lightweight processes that are dumping the contents of the process address space to a core file (see the section "Deleting a Process Address Space" later in this chapter) |
结构补全 * struct completion * | 核心启动完成 core_startup_done | 指向创建核心文件时使用的完成的指针(请参阅第 5 章中的“完成” 部分) Pointer to a completion used when creating a core file (see the section "Completions" in Chapter 5) |
结构补全 struct completion | 核心完成 core_done | 创建核心文件时使用的完成 Completion used when creating a core file |
读写锁_t rwlock_t | ioctx_list_lock ioctx_list_lock | 锁用于保护异步I/O上下文列表(参见第16章) Lock used to protect the list of asynchronous I/O contexts (see Chapter 16) |
结构 kioctx * struct kioctx * | ioctx_列表 ioctx_list | 异步 I/O 上下文列表(参见 第 16 章) List of asynchronous I/O contexts (see Chapter 16) |
结构体kioctx struct kioctx | 默认_kioctx default_kioctx | 默认异步 I/O 上下文(参见 第 16 章) Default asynchronous I/O context (see Chapter 16) |
无符号长 unsigned long | hiwater_rss hiwater_rss | 进程曾经拥有的最大页框数 Maximum number of page frames ever owned by the process |
无符号长 unsigned long | hiwater_vm hiwater_vm | 进程内存区域中包含的最大页数 Maximum number of pages ever included in the memory regions of the process |
所有内存描述符都存储在双向链表中。每个描述符存储字段中相邻列表项的地址mmlist。列表的第一个元素是 的mmlist字段init_mm,它是进程 0 在初始化阶段使用的内存描述符。该列表通过自旋锁来防止多处理器系统中的并发访问mmlist_lock。
All memory descriptors are stored in a doubly linked list. Each
descriptor stores the address of the adjacent list items in the mmlist field. The first element of the list is
the mmlist field of init_mm, the memory descriptor used by process
0 in the initialization phase. The list is protected against concurrent
accesses in multiprocessor systems by the mmlist_lock spin lock.
该mm_users字段存储共享数据结构的轻量级进程的数量(参见第3章中的“ clone()、fork()和vfork()系统调用”mm_struct部分)。该字段是内存描述符的主要使用计数器;中的所有“用户”均视为一个单位。每次该字段减少时,内核都会检查它是否变为零;如果是,则内存描述符被释放,因为它不再被使用。mm_countmm_usersmm_countmm_count
The mm_users field stores the
number of lightweight processes that share the mm_struct data structure (see the section
"The clone( ), fork( ), and
vfork( ) System Calls" in Chapter 3). The mm_count field is the main usage counter of
the memory descriptor; all "users" in mm_users count as one unit in mm_count. Every time the mm_count field is decreased, the kernel checks
whether it becomes zero; if so, the memory descriptor is deallocated
because it is no longer in use.
我们将尝试通过示例来解释mm_users和的使用之间的区别。mm_count考虑由两个轻量级进程共享的内存描述符。通常,其mm_users字段存储值 2,而其
mm_count字段存储值 1(两个所有者进程都算作 1)。
We'll try to explain the difference between the use of mm_users and mm_count with an example. Consider a memory
descriptor shared by two lightweight processes. Normally, its mm_users field stores the value 2, while its
mm_count field stores the value 1
(both owner processes count as one).
如果内存描述符临时借给内核线程(请参阅下一节),内核会增加该mm_count字段。这样,即使两个轻量级进程都死掉并且该mm_users字段变为零,内存描述符也不会被释放,直到内核线程使用完毕为止,因为该mm_count字段仍然大于零。
If the memory descriptor is temporarily lent to a kernel thread
(see the next section), the kernel increases the mm_count field. In this way, even if both
lightweight processes die and the mm_users field becomes zero, the memory
descriptor is not released until the kernel thread finishes using it
because the mm_count field remains
greater than zero.
如果内核想要确保内存描述符在长时间操作过程中不会被释放,它可能会增加该字段
mm_users而不是mm_count(这就是该函数的作用;请参阅第 1 章中的“激活和停用交换区域”try_to_unuse( )部分)17)。最终结果是相同的,因为即使拥有内存描述符的所有轻量级进程都死亡,增量也能确保不会变为零。mm_usersmm_count
If the kernel wants to be sure that the memory descriptor is not
released in the middle of a lengthy operation, it might increase the
mm_users field instead of mm_count (this is what the try_to_unuse( ) function does; see the section
"Activating and
Deactivating a Swap Area" in Chapter 17). The final result is
the same because the increment of mm_users ensures that mm_count does not become zero even if all
lightweight processes that own the memory descriptor die.
mm_alloc( )调用该函数来获取新的内存描述符。因为这些描述符存储在slab分配器缓存中,mm_alloc(
)所以调用kmem_cache_alloc(
),初始化新的内存描述符,并将
mm_countandmm_users字段设置为1。
The mm_alloc( ) function is
invoked to get a new memory descriptor. Because these descriptors are
stored in a slab allocator cache, mm_alloc(
) calls kmem_cache_alloc(
), initializes the new memory descriptor, and sets the
mm_count and mm_users field to 1.
相反,该mmput( )
函数减少mm_users内存描述符的字段。如果该字段变为 0,则该函数释放本地描述符表、内存区域描述符(请参阅本章后面的内容)以及内存描述符引用的页表,然后调用mmdrop( ). 后一个函数减小mm_count
,如果它变为零,则释放mm_struct数据结构。
Conversely, the mmput( )
function decreases the mm_users field
of a memory descriptor. If that field becomes 0, the function releases
the Local Descriptor Table, the memory region descriptors (see later in
this chapter), and the Page Tables referenced by the memory descriptor,
and then invokes mmdrop( ). The
latter function decreases mm_count
and, if it becomes zero, releases the mm_struct data structure.
、、和字段将在下一节中讨论mmap。mm_rbmmlistmmap_cache
The mmap, mm_rb, mmlist, and mmap_cache fields are discussed in the next
section.
内核线程仅在内核模式下运行,因此它们从不访问下面的线性地址(通常TASK_SIZE与 相同)。与常规进程相反,内核线程PAGE_OFFSET0xc0000000不使用内存区域,因此内存描述符的大多数字段对它们来说毫无意义。
Kernel threads run only in Kernel Mode, so they never
access linear addresses below TASK_SIZE (same as PAGE_OFFSET, usually 0xc0000000). Contrary to regular processes,
kernel threads do not use memory regions, therefore most of the fields
of a memory descriptor are meaningless for them.
由于引用上述线性地址的页表条目TASK_SIZE应始终相同,因此内核线程使用哪一组页表并不重要。为了避免无用的 TLB 和缓存刷新,内核线程使用上次运行的常规进程的页表集。为此,每个进程描述符中都包含两种内存描述符指针:mm和active_mm。
Because the Page Table entries that refer to the linear address
above TASK_SIZE should always be
identical, it does not really matter what set of Page Tables a kernel
thread uses. To avoid useless TLB and cache flushes, a kernel thread
uses the set of Page Tables of the last previously running regular
process. To that end, two kinds of memory descriptor pointers are
included in every process descriptor: mm and active_mm.
进程描述符中的字段mm指向进程拥有的内存描述符,而该active_mm字段则指向进程在执行时使用的内存描述符。对于常规进程,这两个字段存储相同的指针。然而,内核线程不拥有任何内存描述符,因此它们的
mm字段始终为NULL。当一个内核线程被选择执行时,它的active_mm字段被初始化为先前运行的进程的值(参见第7章中的“ schedule()函数”active_mm部分)。
The mm field in the process
descriptor points to the memory descriptor owned by the process, while
the active_mm field points to the
memory descriptor used by the process when it is in execution. For
regular processes, the two fields store the same pointer. Kernel
threads, however, do not own any memory descriptor, thus their
mm field is always NULL. When a kernel thread is selected for
execution, its active_mm field is
initialized to the value of the active_mm of the previously running process
(see the section "The
schedule( ) Function" in Chapter 7).
然而,有一个小问题。每当内核模式下的进程修改“高”线性地址(上图TASK_SIZE)的页表条目时,它也应该更新系统中所有进程的页表集中的相应条目。事实上,一旦由内核模式下的进程设置,映射也应该对内核模式下的所有其他进程有效。接触所有进程的页表集是一项代价高昂的操作;因此,Linux采用了延迟的方式。
There is, however, a small complication. Whenever a process in
Kernel Mode modifies a Page Table entry for a "high" linear address
(above TASK_SIZE), it should also
update the corresponding entry in the sets of Page Tables of all
processes in the system. In fact, once set by a process in Kernel
Mode, the mapping should be effective for all other processes in
Kernel Mode as well. Touching the sets of Page Tables of all processes
is a costly operation; therefore, Linux adopts a deferred
approach.
我们已经在第 8 章的“非连续内存区域管理”部分中提到了这种延迟方法:每次必须重新映射高线性地址(通常通过or
)时,内核都会更新一组以主内核页为根的规范页表。全局目录(参见第 2 章中的“内核页表”
部分)。该页面全局目录由主内存描述
符的字段指向vmalloc( )vfree( )swapper_pg_dirpgd ,它存储在init_mm变量中。[ * ]
We already mentioned this deferred approach in the section
"Noncontiguous Memory Area
Management" in Chapter
8: every time a high linear address has to be remapped
(typically by vmalloc( ) or
vfree( )), the kernel updates a
canonical set of Page Tables rooted at the swapper_pg_dir master kernel Page Global
Directory (see the section "Kernel Page Tables" in
Chapter 2). This Page Global
Directory is pointed to by the pgd
field of a master memory descriptor , which is stored in the init_mm variable.[*]
稍后,在“处理非连续内存区域访问”部分中,我们将描述页面错误处理程序如何在有效需要时负责传播存储在规范页表中的信息。
Later, in the section "Handling Noncontiguous Memory Area Accesses," we'll describe how the Page Fault handler takes care of spreading the information stored in the canonical Page Tables when effectively needed.
[ * ]我们在第 3 章的“内核线程”
一节中提到交换
器进程在初始化阶段使用。然而,一旦初始化阶段完成,交换器就永远不会使用这个内存描述符。init_mm
[*] We mentioned in the section "Kernel Threads" in
Chapter 3 that the
swapper process uses init_mm during the initialization phase.
However, swapper never uses this memory descriptor once the
initialization phase completes.
Linux 通过 类型的对象来实现内存区域vm_area_struct;其字段如表9-3所示。[ * ]
Linux implements a memory region by means of an object of
type vm_area_struct; its fields are
shown in Table
9-3.[*]
表 9-3。内存区域对象的字段
Table 9-3. The fields of the memory region object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向拥有该区域的内存描述符的指针。 Pointer to the memory descriptor that owns the region. |
| | 区域内的第一个线性地址。 First linear address inside the region. |
| | 该区域之后的第一个线性地址。 First linear address after the region. |
| | 进程列表中的下一个区域。 Next region in the process list. |
| | 该区域页面框架的访问权限。 Access permissions for the page frames of the region. |
| | 该地区的旗帜。 Flags of the region. |
| | 红黑树的数据(见本章后面)。 Data for the red-black tree (see later in this chapter). |
| | 用于反向映射的数据结构的链接(请参阅第 17 章中的“映射页的反向映射”部分)。 Links to the data structures used for reverse mapping (see the section "Reverse Mapping for Mapped Pages" in Chapter 17). |
结构列表头 struct list_head | 匿名vma节点 anon_vma_node | 匿名内存区域列表的指针(请参阅第 17 章中的“匿名页面的反向映射”部分)。 Pointers for the list of anonymous memory regions (see the section "Reverse Mapping for Anonymous Pages" in Chapter 17). |
结构 anon_vma * struct anon_vma * | 匿名vma anon_vma | 指向数据结构的指针(参见第17章中的“匿名页面的反向映射” Pointer to the |
| | 指向内存区域方法的指针。 Pointer to the methods of the memory region. |
| | 映射文件中的偏移量(参见第 16 章)。对于匿名页面,它要么为零,要么等于 Offset in mapped file (see Chapter 16). For anonymous
pages, it is either zero or equal to |
| | 指向映射文件的文件对象的指针(如果有)。 Pointer to the file object of the mapped file, if any. |
| | 指向内存区域私有数据的指针。 Pointer to private data of the memory region. |
无符号长 unsigned long | vm_截断_计数 vm_truncate_count | 当释放非线性文件内存映射中的线性地址区间时使用。 Used when releasing a linear address interval in a non-linear file memory mapping. |
每个内存区域描述符标识一个线性地址区间。该vm_start字段包含区间的第一个线性地址,而该vm_end字段包含区间外的第一个线性地址;vm_end-vm_start因此表示存储区域的长度。该vm_mm字段指向mm_struct拥有该区域的进程的内存描述符。我们将在其他领域
vm_area_struct出现时对其进行描述。
Each memory region descriptor identifies a linear address
interval. The vm_start field contains
the first linear address of the interval, while the vm_end field contains the first linear address
outside of the interval; vm_end-vm_start thus denotes the length of the
memory region. The vm_mm field points
to the mm_struct memory descriptor of
the process that owns the region. We will describe the other fields of
vm_area_struct as they come
up.
进程拥有的内存区域永远不会重叠,并且当新区域分配到现有区域旁边时,内核会尝试合并区域。如果两个相邻区域的访问权限匹配,则可以合并它们。
Memory regions owned by a process never overlap, and the kernel tries to merge regions when a new one is allocated right next to an existing one. Two adjacent regions can be merged if their access rights match.
如图9-1所示,当新的线性地址范围添加到进程地址空间时,内核检查是否可以扩大已经存在的内存区域(情况a)。如果不是,则创建一个新的内存区域(情况b)。类似地,如果从进程地址空间中删除一系列线性地址,则内核会调整受影响的内存区域的大小(情况 c)。在某些情况下,调整大小会迫使内存区域分成两个较小的区域(情况d)。[ * ]
As shown in Figure 9-1, when a new range of linear addresses is added to the process address space, the kernel checks whether an already existing memory region can be enlarged (case a). If not, a new memory region is created (case b). Similarly, if a range of linear addresses is removed from the process address space, the kernel resizes the affected memory regions (case c). In some cases, the resizing forces a memory region to split into two smaller ones (case d) .[*]
该vm_ops字段指向一个
vm_operations_struct数据结构,该结构存储了内存区域的方法。只有四种方法(如表 9-4所示)适用于 UMA 系统。
The vm_ops field points to a
vm_operations_struct data structure,
which stores the methods of the memory region. Only four
methods—illustrated in Table
9-4—are applicable to UMA systems.
表 9-4。作用于内存区域的方法
Table 9-4. The methods to act on a memory region
方法 Method | 描述 Description |
|---|---|
| 当内存区域添加到进程拥有的区域集中时调用。 Invoked when the memory region is added to the set of regions owned by a process. |
| 当内存区域从进程拥有的区域集中删除时调用。 Invoked when the memory region is removed from the set of regions owned by a process. |
| 当进程尝试访问 RAM 中不存在且其线性地址属于内存区域的页面时,由页面错误异常处理程序调用(请参阅后面的“页面错误异常处理程序”部分)。 Invoked by the Page Fault exception handler when a process tries to access a page not present in RAM whose linear address belongs to the memory region (see the later section "Page Fault Exception Handler"). |
| 调用以设置与内存区域的线性地址相对应的页表条目(故障前)。主要用于非线性文件内存映射。 Invoked to set the page table entries corresponding to the linear addresses of the memory region (prefaulting). Mainly used for non-linear file memory mappings. |
进程拥有的所有区域都链接在一个简单的列表中。区域按内存地址升序出现在列表中;然而,连续的区域可以被未使用的存储器地址区域分隔开。vm_next每个元素的字段都指向vm_area_struct列表中的下一个元素。内核通过进程内存描述符的字段找到内存区域
mmap,该字段指向列表中的第一个内存区域描述符。
All the regions owned by a process are linked in a
simple list. Regions appear in the list in ascending order by memory
address; however, successive regions can be separated by an area of
unused memory addresses. The vm_next field of each vm_area_struct element points to the next
element in the list. The kernel finds the memory regions through the
mmap field of the process memory
descriptor, which points to the first memory region descriptor in the
list.
内存描述符的字段map_count包含进程拥有的区域数量。默认情况下,一个进程最多可以拥有 65,536 个不同的内存区域;但是,系统管理员可以通过写入/proc/sys/vm/max_map_count
文件来更改此限制。
The map_count field of the
memory descriptor contains the number of regions owned by the process.
By default, a process may own up to 65,536 different memory regions;
however, the system administrator may change this limit by writing in
the /proc/sys/vm/max_map_count
file.
图 9-2 说明了进程的地址空间、内存描述符和内存区域列表之间的关系。
Figure 9-2 illustrates the relationships among the address space of a process, its memory descriptor, and the list of memory regions.
内核执行的频繁操作是搜索包含特定线性地址的内存区域。由于列表是排序的,因此只要找到特定线性地址之后结束的内存区域,搜索就可以终止。
A frequent operation performed by the kernel is to search the memory region that includes a specific linear address. Because the list is sorted, the search can terminate as soon as a memory region that ends after the specific linear address is found.
然而,仅当进程的内存区域非常少(假设少于几十个)时,使用该列表才很方便。在列表中查找、插入元素和删除元素涉及到许多操作,其时间与列表长度成线性正比。
However, using the list is convenient only if the process has very few memory regions—let's say less than a few tens of them. Searching, inserting elements, and deleting elements in the list involve a number of operations whose times are linearly proportional to the list length.
尽管大多数 Linux 进程使用的内存区域非常少,但也有一些大型应用程序(例如面向对象的数据库或使用 的专用调试器)具有malloc()数百甚至数千个区域。在这种情况下,内存区域列表管理变得非常低效,因此与内存相关的系统调用的性能下降到无法忍受的地步。
Although most Linux processes use very few memory regions, there
are some large applications, such as object-oriented databases or
specialized debuggers for the usage of malloc(), that have many hundreds or even
thousands of regions. In such cases, the memory region list management
becomes very inefficient, hence the performance of the memory-related
system calls degrades to an intolerable point.
因此,Linux 2.6将内存描述符存储在称为红黑树的数据结构中 。在红黑树中,每个元素(或 节点)通常有两个子元素:左子元素和右子元素。树中的元素已排序。对于每个节点N ,以N的左子节点为根的子树的所有元素都在 N之前,而相反,以N的右子节点为根的子树的所有元素都在 N之后(见图9-3 ( a );关键节点的属性写在节点本身内部。此外,红黑树必须满足四个附加规则:
Therefore, Linux 2.6 stores memory descriptors in data structures called red-black trees . In a red-black tree, each element (or node) usually has two children: a left child and a right child. The elements in the tree are sorted. For each node N, all elements of the subtree rooted at the left child of N precede N, while, conversely, all elements of the subtree rooted at the right child of N follow N (see Figure 9-3(a); the key of the node is written inside the node itself. Moreover, a red-black tree must satisfy four additional rules:
每个节点必须是红色或黑色。
Every node must be either red or black.
树的根必须是黑色的。
The root of the tree must be black.
红色节点的子节点必须是黑色的。
The children of a red node must be black.
Every path from a node to a descendant leaf must contain the same number of black nodes . When counting the number of black nodes, null pointers are counted as black nodes.
这四个规则确保每棵有n 个内部节点的红黑树的 高度最多为 2 × log(n + 1)。
These four rules ensure that every red-black tree with n internal nodes has a height of at most 2 × log(n + 1).
因此,在红黑树中搜索元素非常高效,因为它需要执行时间与树大小的对数成线性比例的操作。换句话说,将内存区域的数量加倍只会为操作增加一次迭代。
Searching an element in a red-black tree is thus very efficient, because it requires operations whose execution time is linearly proportional to the logarithm of the tree size. In other words, doubling the number of memory regions adds just one more iteration to the operation.
在红黑树中插入和删除元素也是高效的,因为算法可以快速遍历树来定位要插入或删除元素的位置。每个新节点必须作为叶子插入并涂成红色。如果操作违反了规则,则必须移动树的一些节点或重新着色。
Inserting and deleting an element in a red-black tree is also efficient, because the algorithm can quickly traverse the tree to locate the position at which the element will be inserted or from which it will be removed. Each new node must be inserted as a leaf and colored red. If the operation breaks the rules, a few nodes of the tree must be moved or recolored.
例如,假设必须将值为 4 的元素插入图 9-3 (a)所示的红黑树中。它的正确位置是键为 3 的节点的右子节点,但是一旦插入,值为 3 的红色节点就有一个红色子节点,从而违反了规则 3。为了满足该规则,具有值 3 的节点的颜色值 3、4 和 7 发生更改。然而,这个操作违反了规则 4,因此算法对以键为 19 的节点为根的子树进行“旋转”,生成如图 9-3所示的新红黑树 (二)。这看起来很复杂,但在红黑树中插入或删除元素需要少量操作——与树大小的对数成线性比例的数字。
For instance, suppose that an element having the value 4 must be inserted in the red-black tree shown in Figure 9-3(a). Its proper position is the right child of the node that has key 3, but once it is inserted, the red node that has the value 3 has a red child, thus breaking rule 3. To satisfy the rule, the color of nodes that have the values 3, 4, and 7 is changed. This operation, however, breaks rule 4, thus the algorithm performs a "rotation" on the subtree rooted at the node that has the key 19, producing the new red-black tree shown in Figure 9-3(b). This looks complicated, but inserting or deleting an element in a red-black tree requires a small number of operations—a number linearly proportional to the logarithm of the tree size.
因此,为了存储进程的内存区域,Linux同时使用了链表和红黑树。两个数据结构都包含指向相同内存区域描述符的指针。当插入或删除内存区域描述符时,内核会通过红黑树搜索前一个和后一个元素,并使用它们快速更新列表,而无需扫描它。
Therefore, to store the memory regions of a process, Linux uses both a linked list and a red-black tree. Both data structures contain pointers to the same memory region descriptors. When inserting or removing a memory region descriptor, the kernel searches the previous and next elements through the red-black tree and uses them to quickly update the list without scanning it.
链表的头部由mmap内存描述符的字段引用。每个内存区域对象都存储指向字段中列表的下一个元素的指针vm_next。红黑树的头部由mm_rb内存描述符的字段引用。vm_rb每个内存区域对象在type 字段中存储节点的颜色,以及指向父节点、左子节点和右子节点的指针
rb_node。
The head of the linked list is referenced by the mmap field of the memory descriptor. Each
memory region object stores the pointer to the next element of the
list in the vm_next field. The head
of the red-black tree is referenced by the mm_rb field of the memory descriptor. Each
memory region object stores the color of the node, as well as the
pointers to the parent, the left child, and the right child, in the
vm_rb field of type rb_node.
一般来说,红黑树用于定位包含特定地址的区域,而链表主要在扫描整个区域集合时有用。
In general, the red-black tree is used to locate a region including a specific address, while the linked list is mostly useful when scanning the whole set of regions.
在继续之前,我们应该澄清页面和内存区域之间的关系。正如第 2 章中提到的,我们使用术语“页”来指代一组线性地址以及这组地址中包含的数据。特别地,我们将范围在 0 到 4,095 之间的线性地址间隔表示为页 0,将范围在 4,096 到 8,191 之间的线性地址间隔表示为页 1,依此类推。因此,每个内存区域都由一组具有连续页号的页组成。
Before moving on, we should clarify the relation between a page and a memory region. As mentioned in Chapter 2, we use the term "page" to refer both to a set of linear addresses and to the data contained in this group of addresses. In particular, we denote the linear address interval ranging between 0 and 4,095 as page 0, the linear address interval ranging between 4,096 and 8,191 as page 1, and so forth. Each memory region therefore consists of a set of pages that have consecutive page numbers.
我们已经讨论了与页面关联的两种标志:
We have already discussed two kinds of flags associated with a page:
一些标志,例如Read/Write、Present、 或存储在每个页表条目中(请参阅第 2 章中的“常规分页”
User/Supervisor部分)。
A few flags such as Read/Write, Present, or User/Supervisor stored in each Page
Table entry (see the section "Regular Paging" in
Chapter 2).
A set of flags stored in the flags field of each page descriptor (see the section "Page Frame Management"
in Chapter 8).
第一种标志由80×86硬件用来检查是否可以执行所请求的那种寻址;Linux 将第二种用于许多不同的目的(参见表8-2)。
The first kind of flag is used by the 80 × 86 hardware to check whether the requested kind of addressing can be performed; the second kind is used by Linux for many different purposes (see Table 8-2).
我们现在介绍第三种标志:与内存区域的页面相关的标志。它们存储在描述符vm_flags的字段中vm_area_struct(参见表9-5)。一些旗帜向内核提供有关内存区域所有页面的信息,例如它们包含的内容以及进程访问每个页面的权限。其他标志描述了该地区本身,例如它如何发展。
We now introduce a third kind of flag: those associated with the
pages of a memory region. They are stored in the vm_flags field of the vm_area_struct descriptor (see Table 9-5). Some
flags offer the kernel information about all the pages of the
memory region, such as what they contain and what rights the process
has to access each page. Other flags describe the region itself, such
as how it can grow.
表 9-5。内存区域标志
Table 9-5. The memory region flags
旗帜名称 Flag name | 描述 Description |
|---|---|
| 可以读取页面 Pages can be read |
| 页面可写 Pages can be written |
| 可以执行页面 Pages can be executed |
| 页面可以被多个进程共享 Pages can be shared by several processes |
| |
| |
| |
| |
| 该区域可以向较低地址扩展 The region can expand toward lower addresses |
| 该区域可以向更高的地址扩展 The region can expand toward higher addresses |
| 该区域用于IPC的共享内存 The region is used for IPC's shared memory |
| 该区域映射了一个无法打开写入的文件 The region maps a file that cannot be opened for writing |
| 该区域映射一个可执行文件 The region maps an executable file |
| 该区域内的页面被锁定,无法换出 Pages in the region are locked and cannot be swapped out |
| 该区域映射设备的 I/O 地址空间 The region maps the I/O address space of a device |
| 应用程序按顺序访问页面 The application accesses the pages sequentially |
| 应用程序以真正随机的顺序访问页面 The application accesses the pages in a truly random order |
| fork 新进程时不要复制区域 Do not copy the region when forking a new process |
| |
| 该区域是特殊的(例如,它映射设备的I/O地址空间),因此它的页面不能被换出 The region is special (for instance, it maps the I/O address space of a device), so its pages must not be swapped out |
| 创建IPC共享内存区域时检查是否有足够的空闲内存用于映射(参见第19章) Check whether there is enough free memory for the mapping when creating an IPC shared memory region (see Chapter 19) |
| 该区域内的页面通过扩展分页机制进行处理(参见第2章“扩展分页”部分) The pages in the region are handled through the extended paging mechanism (see the section "Extended Paging" in Chapter 2) |
| 该区域实现了非线性文件映射 The region implements a non-linear file mapping |
存储器区域描述符中包括的页面访问权限可以任意组合。例如,可以允许读取但不执行某个区域的页面。为了有效地实现这种保护方案,与存储器区域的页面相关联的读取、写入和执行访问权限必须在所有相应的页表条目中复制,以便分页单元电路可以直接执行检查。换句话说,页面访问权限决定了哪些类型的访问应该生成页面错误例外。正如我们很快就会看到的,Linux 将找出导致页面错误的原因的工作委托给页面错误处理程序,该处理程序实现了多种页面处理策略。
Page access rights included in a memory region descriptor may be combined arbitrarily. It is possible, for instance, to allow the pages of a region to be read but not executed. To implement this protection scheme efficiently, the Read, Write, and Execute access rights associated with the pages of a memory region must be duplicated in all the corresponding Page Table entries, so that checks can be directly performed by the Paging Unit circuitry. In other words, the page access rights dictate what kinds of access should generate a Page Fault exception. As we'll see shortly, the job of figuring out what caused the Page Fault is delegated by Linux to the Page Fault handler, which implements several page-handling strategies.
页表标志的初始值(正如我们所见,对于内存区域中的所有页来说必须相同)存储在描述符vm_ page_ prot的字段
中vm_area_struct。添加页面时,内核根据该字段的值在相应的页表项中设置标志vm_
page_ prot。
The initial values of the Page Table flags (which must be the
same for all pages in the memory region, as we have seen) are stored
in the vm_ page_ prot field of the
vm_area_struct descriptor. When
adding a page, the kernel sets the flags in the corresponding Page
Table entry according to the value of the vm_
page_ prot field.
然而,将内存区域的访问权限转换为页面保护位并不简单,原因如下:
However, translating the memory region's access rights into the page protection bits is not straightforward for the following reasons:
vm_flags在某些情况下,即使页面访问的访问类型是由相应内存区域的字段中指定的页面访问权限授予的,页面访问也应该生成页面错误异常。例如,正如我们将在本章后面的“写时复制”部分中看到的,内核可能希望VM_SHARE在同一页帧中存储属于两个不同进程的两个相同的、可写的私有页面(其标志被清除);在这种情况下,当任一进程尝试修改页面时,应该生成异常。
In some cases, a page access should generate a Page Fault
exception even when its access type is granted by the page access
rights specified in the vm_flags field of the corresponding
memory region. For instance, as we'll see in the section "Copy On Write" later
in this chapter, the kernel may wish to store two identical,
writable private pages (whose VM_SHARE flags are cleared) belonging to
two different processes in the same page frame; in this case, an
exception should be generated when either one of the processes
tries to modify the page.
正如第 2 章中提到的,80×86 处理器的页表只有两个保护位,即Read/Write和User/Supervisor标志。此外,
User/Supervisor必须始终设置内存区域中包含的每个页面的标志,因为该页面必须始终可由用户模式进程访问。
As mentioned in Chapter
2, 80 × 86 processors's Page Tables have just two
protection bits, namely the Read/Write and User/Supervisor flags. Moreover, the
User/Supervisor flag of every
page included in a memory region must always be set, because the
page must always be accessible by User Mode processes.
NX最新启用 PAE 的 Intel Pentium 4 微处理器在每个 64 位页表条目中都带有一个(No eXecute)标志。
Recent Intel Pentium 4 microprocessors with PAE enabled
sport a NX (No eXecute) flag in
each 64-bit Page Table entry.
如果内核编译时不支持 PAE,Linux 采用以下规则,克服 80 × 86 微处理器的硬件限制:
If the kernel has been compiled without support for PAE, Linux adopts the following rules, which overcome the hardware limitation of the 80 × 86 microprocessors:
读取访问权限始终意味着执行访问权限,反之亦然。
The Read access right always implies the Execute access right, and vice versa.
写访问权限始终意味着读访问权限。
The Write access right always implies the Read access right.
相反,如果内核已编译为支持 PAE 并且 CPU 具有该NX标志,则 Linux 采用不同的规则:
Conversely, if the kernel has been compiled with support for PAE
and the CPU has the NX flag, Linux
adopts different rules:
执行访问权限始终意味着读取访问权限。
The Execute access right always implies the Read access right.
写访问权限始终意味着读访问权限。
The Write access right always implies the Read access right.
此外,为了通过“写入时复制”技术正确推迟页框的分配(请参阅本章后面的内容),只要相应的页不能被多个进程共享,页框就会被写保护。
Moreover, to correctly defer the allocation of page frames through the "Copy On Write" technique (see later in this chapter), the page frame is write-protected whenever the corresponding page must not be shared by several processes.
因此,读、写、执行和共享访问权限的 16 种可能组合将根据以下规则缩小:
Therefore, the 16 possible combinations of the Read, Write, Execute, and Share access rights are scaled down according to the following rules:
如果该页同时具有写入和共享访问权限,则该
Read/Write位被设置。
If the page has both Write and Share access rights, the
Read/Write bit is set.
如果该页具有读取或执行访问权限,但没有写入或共享访问权限,则该Read/Write位被清除。
If the page has the Read or Execute access right but does
not have either the Write or the Share access right, the Read/Write bit is cleared.
如果NX支持该位并且该页没有执行访问权限,则
NX设置该位。
If the NX bit is
supported and the page does not have the Execute access right, the
NX bit is set.
如果该页没有任何访问权限,则该Present位被清除,以便每次访问都会生成缺页异常。然而,为了区分这种情况与实际页面不存在的情况,Linux 还将该Page size位设置为 1。[ * ]
If the page does not have any access rights, the Present bit is cleared so that each
access generates a Page Fault exception. However, to distinguish
this condition from the real page-not-present case, Linux also
sets the Page size bit to
1.[*]
与每个访问权限组合相对应的缩减后的保护位存储在数组的 16 个元素中protection_map。
The downscaled protection bits corresponding to each combination
of access rights are stored in the 16 elements of the protection_map array.
对控制内存处理的数据结构和状态信息有基本的了解,我们可以查看一组对内存区域描述符进行操作的低级函数。do_mmap( )它们应该被视为简化和的实现的辅助函数do_munmap( )。这两个函数将在本章后面的“分配线性地址间隔”和“释放线性地址间隔”部分中描述,分别扩大和缩小进程的地址空间。它们工作在比我们在这里考虑的函数更高的级别,它们不接收内存区域描述符作为参数,而是接收线性地址间隔的初始地址、长度和访问权限。
Having the basic understanding of data structures and
state information that control memory handling , we can look at a group of low-level functions that
operate on memory region descriptors. They should be considered
auxiliary functions that simplify the implementation of do_mmap( ) and do_munmap( ). Those two functions, which are
described in the sections "Allocating a Linear Address
Interval" and "Releasing a Linear Address
Interval" later in this chapter, enlarge and shrink the address
space of a process, respectively. Working at a higher level than the
functions we consider here, they do not receive a memory region
descriptor as their parameter, but rather the initial address, the
length, and the access rights of a linear address interval.
该find_vma( )函数作用于两个参数:进程内存描述符的地址mm和线性地址addr。它定位第一个vm_end字段大于的内存区域addr并返回其描述符的地址;如果不存在这样的区域,则返回一个NULL指针。请注意,选择的区域find_vma( )不一定包括在内addr,因为
addr可能位于任何内存区域之外。
The find_vma( ) function
acts on two parameters: the address mm of a process memory descriptor and a
linear address addr. It locates
the first memory region whose vm_end field is greater than addr and returns the address of its
descriptor; if no such region exists, it returns a NULL pointer. Notice that the region
selected by find_vma( ) does not
necessarily include addr because
addr may lie outside of any
memory region.
每个内存描述符包括一个mmap_cache 存储进程最后引用的区域的描述符地址的字段。引入这个附加字段是为了减少查找包含给定线性地址的区域所花费的时间。程序中地址引用的局部性使得如果最后检查的线性地址属于给定区域,则下一个要检查的线性地址很可能属于同一区域。
Each memory descriptor includes an mmap_cache field that stores the descriptor address of the
region that was last referenced by the process. This additional
field is introduced to reduce the time spent in looking for the
region that contains a given linear address. Locality of address
references in programs makes it highly likely that if the last
linear address checked belonged to a given region, the next one to
be checked belongs to the same region.
因此,该函数首先检查由 标识的区域是否mmap_cache包含
addr。如果是,则返回区域描述符指针:
The function thus starts by checking whether the region
identified by mmap_cache includes
addr. If so, it returns the
region descriptor pointer:
vma = mm->mmap_cache;
if (vma && vma->vm_end > addr && vma->vm_start <= addr)
返回vma; vma = mm->mmap_cache;
if (vma && vma->vm_end > addr && vma->vm_start <= addr)
return vma;否则,必须扫描进程的内存区域,并且该函数在红黑树中查找内存区域:
Otherwise, the memory regions of the process must be scanned, and the function looks up the memory region in the red-black tree:
rb_node = mm->mm_rb.rb_node;
vma=空;
while (rb_node) {
vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
if (vma_tmp->vm_end > addr) {
vma = vma_tmp;
if (vma_tmp->vm_start <= addr)
休息;
rb_node = rb_node->rb_left;
} 别的
rb_node = rb_node->rb_right;
}
如果(VMA)
mm->mmap_cache = vma;
返回vma; rb_node = mm->mm_rb.rb_node;
vma = NULL;
while (rb_node) {
vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
if (vma_tmp->vm_end > addr) {
vma = vma_tmp;
if (vma_tmp->vm_start <= addr)
break;
rb_node = rb_node->rb_left;
} else
rb_node = rb_node->rb_right;
}
if (vma)
mm->mmap_cache = vma;
return vma;该函数使用rb_entry宏,该宏从指向红黑树节点的指针派生出相应内存区域描述符的地址。
The function uses the rb_entry macro, which derives from a
pointer to a node of the red-black tree the address of the
corresponding memory region descriptor.
该find_vma_prev( )
函数与 类似find_vma(
),只是它在一个附加pprev参数中写入一个指针,该指针指向该函数所选内存区域之前的内存区域的描述符。
The find_vma_prev( )
function is similar to find_vma(
), except that it writes in an additional pprev parameter a pointer to the
descriptor of the memory region that precedes the one selected by
the function.
最后,该find_vma_prepare(
)函数定位红黑树中与给定线性地址对应的新叶子的位置,并返回之前的内存区域和要插入的叶子的父节点的地址。
Finally, the find_vma_prepare(
) function locates the position of the new leaf in the
red-black tree that corresponds to a given linear address and
returns the addresses of the preceding memory region and of the
parent node of the leaf to be inserted.
该find_vma_intersection(
)函数查找与给定线性地址间隔重叠的第一个内存区域;参数mm指向进程的内存描述符,而start_addr和end_addr线性地址则指定区间:
The find_vma_intersection(
) function finds the first memory region that overlaps a
given linear address interval; the mm parameter points to the memory
descriptor of the process, while the start_addr and end_addr linear addresses specify the
interval:
vma = find_vma(mm,start_addr);
if (vma && end_addr <= vma->vm_start)
vma=空;
返回vma; vma = find_vma(mm,start_addr);
if (vma && end_addr <= vma->vm_start)
vma = NULL;
return vma;NULL
如果不存在这样的区域,该函数将返回一个指针。确切地说,如果find_vma( )返回一个有效地址,但找到的内存区域在线性地址间隔结束之后开始,vma则设置为NULL。
The function returns a NULL
pointer if no such region exists. To be exact, if find_vma( ) returns a valid address but
the memory region found starts after the end of the linear address
interval, vma is set to NULL.
该get_unmapped_area(
)函数搜索进程地址空间以查找可用的线性地址区间。该len参数指定间隔长度,而非空addr
参数指定必须开始搜索的地址。如果查找成功,函数返回新区间的起始地址;否则,返回错误代码-ENOMEM。
The get_unmapped_area(
) function searches the process address space to find an
available linear address interval. The len parameter specifies the interval
length, while a non-null addr
parameter specifies the address from which the search must be
started. If the search is successful, the function returns the
initial address of the new interval; otherwise, it returns the error
code -ENOMEM.
如果addr参数不是NULL,则该函数检查指定的地址是否在用户模式地址空间中并且是否与页边界对齐。接下来,该函数调用两种方法之一,具体取决于线性地址间隔是否应用于文件内存映射或匿名内存映射。前一种情况,该函数执行get_unmapped_area文件操作;这将在第 16 章中讨论。
If the addr parameter is
not NULL, the function checks
that the specified address is in the User Mode address space and
that it is aligned to a page boundary. Next, the function invokes
either one of two methods, depending on whether the linear address
interval should be used for a file memory mapping or for an
anonymous memory mapping. In the former case, the function executes
the get_unmapped_area file
operation; this is discussed in Chapter 16.
在后一种情况下,函数执行get_unmapped_area内存描述符的方法。反过来,
根据进程的内存区域布局,该方法由arch_get_unmapped_area( )
函数或函数来实现。正如我们将在第 20 章的“程序段和进程内存区域”arch_get_unmapped_area_topdown( )部分中看到的那样,每个进程都可以有两种不同的内存区域布局,这些内存区域是通过mmap( ) 系统调用:它们要么从线性地址开始0x40000000并向更高的地址增长,要么从用户模式堆栈的正上方开始并向更低的地址增长。
In the latter case, the function executes the get_unmapped_area method of the memory
descriptor. In turn, this method is implemented by either the
arch_get_unmapped_area( )
function, or the arch_get_unmapped_area_topdown( )
function, according to the memory region layout of the process. As
we'll see in the section "Program Segments and Process
Memory Regions" in Chapter 20, every process can
have two different layouts for the memory regions allocated through
the mmap( ) system call: either they start from the linear
address 0x40000000 and grow
towards higher addresses, or they start right above the User Mode
stack and grow towards lower addresses.
让我们讨论一下arch_get_unmapped_area( )当内存区域从较低地址移动到较高地址时使用的函数。它本质上等价于下面的代码片段:
Let us discuss the arch_get_unmapped_area( ) function, which
is used when the memory regions are allocated moving from lower
addresses to higher ones. It is essentially equivalent to the
following code fragment:
if (len > TASK_SIZE)
返回-ENOMEM;
地址 = (地址 + 0xfff) & 0xfffff000;
if (addr && addr + len <= TASK_SIZE) {
vma = find_vma(当前->mm, addr);
if (!vma || addr + len <= vma->vm_start)
返回地址;
}
start_addr = addr = mm->free_area_cache;
for (vma = find_vma(当前->mm, addr); ; vma = vma->vm_next) {
if (addr + len > TASK_SIZE) {
if (start_addr == (TASK_SIZE/3+0xfff)&0xfffff000)
返回-ENOMEM;
start_addr = addr = (TASK_SIZE/3+0xfff)&0xfffff000;
vma = find_vma(当前->mm, addr);
}
if (!vma || addr + len <= vma->vm_start) {
mm->free_area_cache = addr + len;
返回地址;
}
addr = vma->vm_end;
} if (len > TASK_SIZE)
return -ENOMEM;
addr = (addr + 0xfff) & 0xfffff000;
if (addr && addr + len <= TASK_SIZE) {
vma = find_vma(current->mm, addr);
if (!vma || addr + len <= vma->vm_start)
return addr;
}
start_addr = addr = mm->free_area_cache;
for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
if (addr + len > TASK_SIZE) {
if (start_addr == (TASK_SIZE/3+0xfff)&0xfffff000)
return -ENOMEM;
start_addr = addr = (TASK_SIZE/3+0xfff)&0xfffff000;
vma = find_vma(current->mm, addr);
}
if (!vma || addr + len <= vma->vm_start) {
mm->free_area_cache = addr + len;
return addr;
}
addr = vma->vm_end;
}该函数首先检查以确保间隔长度TASK_SIZE在用户模式线性地址的限制(通常为 3 GB)内。如果
addr不为零,则该函数尝试分配从 开始的区间addr。为了安全起见,该函数将 的值向上舍入addr为 4 KB 的倍数。
The function starts by checking to make sure the interval
length is within TASK_SIZE, the
limit imposed on User Mode linear addresses (usually 3 GB). If
addr is different from zero, the
function tries to allocate the interval starting from addr. To be on the safe side, the function
rounds up the value of addr to a
multiple of 4 KB.
如果addr为 0 或先前的搜索失败,则该arch_get_unmapped_area( )函数将扫描用户模式线性地址空间,查找未包含在任何内存区域中且足够大以包含新区域的线性地址范围。为了加速搜索,搜索的起点通常设置为最后分配的内存区域之后的线性地址。这mm->free_area_cache内存描述符的字段被初始化为用户模式线性地址空间的三分之一(通常为 1 GB),然后在创建新内存区域时更新。如果该函数未能找到合适的线性地址范围,则从头开始搜索,即从用户模式线性地址空间的三分之一开始:事实上,用户模式线性地址空间的前三分之一被保留对于具有预定义起始线性地址的内存区域,通常是可执行文件的文本、数据和 bss 段(参见第20 章)。
If addr is 0 or the
previous search failed, the arch_get_unmapped_area( ) function scans
the User Mode linear address space looking for a range of linear
addresses not included in any memory region and large enough to
contain the new region. To speed up the search, the search's
starting point is usually set to the linear address following the
last allocated memory region. The mm->free_area_cache field of the memory
descriptor is initialized to one-third of the User Mode linear
address space—usually, 1 GB—and then updated as new memory regions
are created. If the function fails in finding a suitable range of
linear addresses, the search restarts from the beginning—that is,
from one-third of the User Mode linear address space: in fact, the
first third of the User Mode linear address space is reserved for
memory regions having a predefined starting linear address,
typically the text, data, and bss segments of an executable file
(see Chapter 20).
该函数调用find_vma(
)以定位在搜索起点之后结束的第一个内存区域,然后重复考虑所有后续内存区域。可能会出现三种情况:
The function invokes find_vma(
) to locate the first memory region ending after the
search's starting point, then repeatedly considers all the following
memory regions. Three cases may occur:
请求的间隔大于尚未扫描的线性地址空间部分 ( addr + len > TASK_SIZE):在这种情况下,该函数要么从用户模式地址空间的三分之一重新启动,要么如果第二次搜索已完成,则返回-ENOMEM(没有足够的线性地址来满足请求)。
The requested interval is larger than the portion of
linear address space yet to be scanned (addr + len > TASK_SIZE): in this
case, the function either restarts from one-third of the User
Mode address space or, if the second search has already been
done, returns -ENOMEM (there
are not enough linear addresses to satisfy the request).
最后扫描区域后面的孔不够大 ( vma != NULL &&
vma->vm_start < addr + len)。在这种情况下,该函数会考虑下一个区域。
The hole following the last scanned region is not large
enough (vma != NULL &&
vma->vm_start < addr + len). In this case, the
function considers the next region.
如果上述条件都不成立,则已发现足够大的洞。在这种情况下,函数返回
addr。
If neither one of the preceding conditions holds, a large
enough hole has been found. In this case, the function returns
addr.
insert_vm_struct(
)在内存区域对象列表中插入一个vm_area_struct结构体和一个内存描述符的红黑树。它使用两个参数:mm,指定进程内存描述符的地址;vma,指定
vm_area_struct要插入的对象的地址。vm_start内存区域对象的和
字段vm_end必须已经初始化。该函数调用该find_vma_prepare( )
函数在红黑树中查找mm->mm_rb应该vma去的位置。然后insert_vm_struct( )调用该vma_link( )函数,该函数依次:
insert_vm_struct(
) inserts a vm_area_struct structure in the memory
region object list and red-black tree of a memory descriptor. It
uses two parameters: mm, which
specifies the address of a process memory descriptor, and vma, which specifies the address of the
vm_area_struct object to be
inserted. The vm_start and
vm_end fields of the memory
region object must have already been initialized. The function
invokes the find_vma_prepare( )
function to look up the position in the red-black tree mm->mm_rb where vma should go. Then insert_vm_struct( ) invokes the vma_link( ) function, which in
turn:
将内存区域插入到 引用的链表中
mm->mmap。
Inserts the memory region in the linked list referenced by
mm->mmap.
将内存区域插入红黑树中mm->mm_rb。
Inserts the memory region in the red-black tree mm->mm_rb.
如果内存区域是匿名的,则将该区域插入到相应数据结构的列表中(参见第17章中的“匿名页面的反向映射”anon_vma部分)。
If the memory region is anonymous, inserts the region in
the list headed at the corresponding anon_vma data structure (see the
section "Reverse
Mapping for Anonymous Pages" in Chapter 17).
增加mm->map_count计数器。
Increases the mm->map_count counter.
如果该区域包含内存映射文件,该函数将执行第 17 章vma_link( )中描述的附加任务。
If the region contains a memory-mapped file, the vma_link( ) function performs additional
tasks that are described in Chapter 17.
该_ _vma_unlink( )
函数接收一个内存描述符地址
mm和两个内存区域对象地址vma和作为其参数prev。两个内存区域都应该属于
mm,并且在内存区域排序中prev应该位于前面。该函数从链表和红黑树中vma删除内存描述符。vma如果该字段指向刚刚删除的内存区域,它还会更新mm->mmap_cache存储最后引用的内存区域的 。
The _ _vma_unlink( )
function receives as its parameters a memory descriptor address
mm and two memory region object
addresses vma and prev. Both memory regions should belong to
mm, and prev should precede vma in the memory region ordering. The
function removes vma from the
linked list and the red-black tree of the memory descriptor. It also
updates mm->mmap_cache, which
stores the last referenced memory region, if this field points to
the memory region just deleted.
现在让我们讨论新的线性地址间隔如何被分配。为此,该do_mmap( )函数为进程创建并初始化一个新的内存区域current
。但是,成功分配后,内存区域可以与为进程定义的其他内存区域合并。
Now let's discuss how new linear address
intervals are allocated. To do this, the do_mmap( ) function creates and initializes
a new memory region for the current
process. However, after a successful allocation, the memory region
could be merged with other memory regions defined for the
process.
该函数使用以下参数:
The function uses the following parameters:
file和offsetfile and offset如果新的内存区域将文件映射到内存,则使用文件对象指针file和文件偏移量。该主题将在第 16 章offset中讨论
。在本节中,我们假设不需要内存映射,并且
和都是。fileoffsetNULL
File object pointer file and file offset offset are used if the new memory
region will map a file into memory. This topic is discussed in
Chapter 16. In this
section, we assume that no memory mapping is required and that
file and offset are both NULL.
addraddr该线性地址指定必须从何处开始搜索空闲间隔。
This linear address specifies where the search for a free interval must start.
lenlen线性地址间隔的长度。
The length of the linear address interval.
protprot该参数指定内存区域中包含的页面的访问权限。可能的标志有PROT_READ、PROT_WRITE、PROT_EXEC和PROT_NONE。VM_READ前三个标志与、VM_WRITE和标志含义相同VM_EXEC。PROT_NONE表明该进程没有这些访问权限。
This parameter specifies the access rights of the pages
included in the memory region. Possible flags are PROT_READ, PROT_WRITE, PROT_EXEC, and PROT_NONE. The first three flags mean
the same things as the VM_READ, VM_WRITE, and VM_EXEC flags. PROT_NONE indicates that the process
has none of those access rights.
flagflag该参数指定剩余的内存区域标志:
MAP_GROWSDOWN,
MAP_LOCKED, MAP_DENYWRITE, 和MAP_EXECUTABLE它们的含义与表9-5中列出的标志的含义相同。
MAP_SHARED和
MAP_PRIVATE前一个标志指定内存区域中的页面可以在多个进程之间共享;后一个标志具有相反的效果。两个标志都引用描述符VM_SHARED中的标志vm_area_struct
。
MAP_FIXED区间的起始线性地址必须与addr参数中指定的地址完全相同。
MAP_ANONYMOUS没有文件与内存区域关联(参见 第 16 章)。
MAP_NORESERVE该函数不必对空闲页框的数量进行初步检查。
MAP_POPULATE该函数应该预先分配内存区域建立的映射所需的页框。该标志仅对于映射文件的内存区域(参见第 16 章)和 IPC 共享内存区域(参见第 19 章)有意义。
MAP_NONBLOCK仅当MAP_POPULATE设置该标志时才有意义:预分配页框时,该函数不得阻塞。
This parameter specifies the remaining memory region flags:
MAP_GROWSDOWN,
MAP_LOCKED, MAP_DENYWRITE, and MAP_EXECUTABLETheir meanings are identical to those of the flags listed in Table 9-5.
MAP_SHARED and
MAP_PRIVATEThe former flag specifies that the pages in the
memory region can be shared among several processes; the
latter flag has the opposite effect. Both flags refer to
the VM_SHARED flag in
the vm_area_struct
descriptor.
MAP_FIXEDThe initial linear address of the interval must be
exactly the one specified in the addr parameter.
MAP_ANONYMOUSNo file is associated with the memory region (see Chapter 16).
MAP_NORESERVEThe function doesn't have to do a preliminary check on the number of free page frames.
MAP_POPULATEThe function should pre-allocate the page frames required for the mapping established by the memory region. This flag is significant only for memory regions that map files (see Chapter 16) and for IPC shared memory regions (see Chapter 19).
MAP_NONBLOCKSignificant only when the MAP_POPULATE flag is set: when
pre-allocating the page frames, the function must not
block.
该do_mmap( )函数对 的值执行一些初步检查offset,然后执行该do_mmap_pgoff( )函数。在本章中,我们将假设新的线性地址区间不映射磁盘上的文件——文件内存映射将在第16章中详细讨论。do_mmap_pgoff( )以下是匿名内存区域函数的描述:
The do_mmap( ) function
performs some preliminary checks on the value of offset and then executes the do_mmap_pgoff( ) function. In this chapter
we will suppose that the new interval of linear address does not map a
file on disk—file memory mapping is discussed in detail in Chapter 16. Here is a description
of the do_mmap_pgoff( ) function
for anonymous memory regions:
检查参数值是否正确,是否能够满足请求。特别是,它会检查以下阻止其满足请求的条件:
线性地址间隔的长度为零或包含大于 的地址TASK_SIZE。
该进程已经映射了过多的内存区域,即map_count其mm内存描述符字段的值超出了允许的最大值。
该flag参数指定新的线性地址区间的页必须在RAM中锁定,但不允许进程创建锁定的内存区域,或者进程锁定的页数超过进程描述符字段中存储的阈signal->rlim[RLIMIT_MEMLOCK].rlim_cur
值。
如果满足前述任何条件,do_mmap_pgoff( )则通过返回负值来终止。如果线性地址间隔的长度为零,则函数返回而不执行任何操作。
Checks whether the parameter values are correct and whether the request can be satisfied. In particular, it checks for the following conditions that prevent it from satisfying the request:
The linear address interval has zero length or includes
addresses greater than TASK_SIZE.
The process has already mapped too many memory
regions—that is, the value of the map_count field of its mm memory descriptor exceeds the
allowed maximum value.
The flag parameter
specifies that the pages of the new linear address interval
must be locked in RAM, but the process is not allowed to
create locked memory regions, or the number of pages locked by
the process exceeds the threshold stored in the signal->rlim[RLIMIT_MEMLOCK].rlim_cur
field of the process descriptor.
If any of the preceding conditions holds, do_mmap_pgoff( ) terminates by returning
a negative value. If the linear address interval has a zero
length, the function returns without performing any action.
调用get_unmapped_area(
)以获得新区域的线性地址间隔(请参阅上一节“内存区域处理”)。
Invokes get_unmapped_area(
) to obtain a linear address interval for the new region
(see the previous section "Memory Region
Handling").
prot通过组合和
参数中存储的值来计算新内存区域的标志flags:
vm_flags = calc_vm_prot_bits(prot,flags) |
calc_vm_flag_bits(prot,标志)|
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYWRITE | VM_MAYEXEC;
如果(标志和 MAP_SHARED)
vm_flags |= VM_SHARED | VM_MAYSHARE;calc_vm_prot_bits( )
仅当设置了相应的
VM_READ、
VM_WRITE和VM_EXEC标志时,该函数才会设置、 和标志。仅当设置了
相应的
、、和标志时,该函数才会设置、 、和标志。其他一些标志设置在: , ,中,所有内存区域的默认标志在, [ * ]以及两者中vm_flagsPROT_READPROT_WRITEPROT_EXECprotcalc_vm_flag_bits( )VM_GROWSDOWNVM_DENYWRITEVM_EXECUTABLEVM_LOCKEDvm_flagsMAP_GROWSDOWNMAP_DENYWRITEMAP_EXECUTABLEMAP_LOCKEDflagsvm_flagsVM_MAYREADVM_MAYWRITEVM_MAYEXECmm->def_flagsVM_SHAREDVM_MAYSHARE内存区域的页面是否必须与其他进程共享。
Computes the flags of the new memory region by combining the
values stored in the prot and
flags parameters:
vm_flags = calc_vm_prot_bits(prot,flags) |
calc_vm_flag_bits(prot,flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_SHARED)
vm_flags |= VM_SHARED | VM_MAYSHARE;The calc_vm_prot_bits( )
function sets the VM_READ,
VM_WRITE, and VM_EXEC flags in vm_flags only if the corresponding
PROT_READ, PROT_WRITE, and PROT_EXEC flags in prot are set. The calc_vm_flag_bits( ) function sets the
VM_GROWSDOWN, VM_DENYWRITE, VM_EXECUTABLE, and VM_LOCKED flags in vm_flags only if the corresponding
MAP_GROWSDOWN, MAP_DENYWRITE, MAP_EXECUTABLE, and MAP_LOCKED flags in flags are set. A few other flags are set
in vm_flags: VM_MAYREAD, VM_MAYWRITE, VM_MAYEXEC, the default flags for all
memory regions in mm->def_flags,[*] and both VM_SHARED and VM_MAYSHARE if the pages of the memory
region have to be shared with other processes.
调用find_vma_prepare(
)以定位新区间之前的内存区域的对象,以及新区域在红黑树中的位置:
为了 (;;) {
vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
if (!vma || vma->vm_start >= addr + len)
休息;
if (do_munmap(mm, addr, len))
返回-ENOMEM;
}该find_vma_prepare( )
函数还检查与新间隔重叠的内存区域是否已存在。当函数返回NULL指向新间隔结束之前开始的区域的非地址时,就会发生这种情况。在这种情况下,do_mmap_pgoff( )调用
do_munmap( )删除新的间隔,然后重复整个步骤(请参阅后面的“释放线性地址间隔”部分)。
Invokes find_vma_prepare(
) to locate the object of the memory region that shall
precede the new interval, as well as the position of the new
region in the red-black tree:
for (;;) {
vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
if (!vma || vma->vm_start >= addr + len)
break;
if (do_munmap(mm, addr, len))
return -ENOMEM;
}The find_vma_prepare( )
function also checks whether a memory region that overlaps the new
interval already exists. This occurs when the function returns a
non-NULL address pointing to a
region that starts before the end of the new interval. In this
case, do_mmap_pgoff( ) invokes
do_munmap( ) to remove the new
interval and then repeats the whole step (see the later section
"Releasing a Linear
Address Interval").
检查插入新内存区域是否导致进程地址空间的大小(mm->total_vm<<PAGE_SHIFT)+len
超过signal->rlim[RLIMIT_AS].rlim_cur
进程描述符字段中存储的阈值。如果是,则返回错误代码
-ENOMEM。请注意,检查是在此处完成的,而不是在步骤 1 中与其他检查一起完成的,因为某些内存区域可能已在步骤 4 中删除。
Checks whether inserting the new memory region causes the
size of the process address space (mm->total_vm<<PAGE_SHIFT)+len
to exceed the threshold stored in the signal->rlim[RLIMIT_AS].rlim_cur
field of the process descriptor. If so, it returns the error code
-ENOMEM. Notice that the check
is done here and not in step 1 with the other checks, because some
memory regions could have been removed in step 4.
-ENOMEM如果参数MAP_NORESERVE中未设置标志
flags,新内存区域包含私有可写页,并且没有足够的可用页框,则返回错误代码;最后的检查是由该security_vm_enough_memory( )
函数执行的。
Returns the error code -ENOMEM if the MAP_NORESERVE flag was not set in the
flags parameter, the new memory
region contains private writable pages, and there are not enough
free page frames; this last check is performed by the security_vm_enough_memory( )
function.
如果新间隔是私有的(VM_SHARED未设置)并且它没有映射磁盘上的文件,则它会调用vma_merge(
)检查是否可以以包含新间隔的方式扩展前面的内存区域。当然,前面的内存区域必须与局部变量中存储的内存区域具有完全相同的标志vm_flags。如果前面的内存区域可以扩展,vma_merge( )则还尝试将其与后面的内存区域合并(当新间隔填充两个内存区域之间的空洞并且所有三个内存区域都具有相同的标志时,就会发生这种情况)。如果成功扩展了前面的内存区域,该函数就会跳转到步骤 12。
If the new interval is private (VM_SHARED not set) and it does not map a
file on disk, it invokes vma_merge(
) to check whether the preceding memory region can be
expanded in such a way to include the new interval. Of course, the
preceding memory region must have exactly the same flags as those
memory regions stored in the vm_flags local variable. If the
preceding memory region can be expanded, vma_merge( ) also tries to merge it with
the following memory region (this occurs when the new interval
fills the hole between two memory regions and all three have the
same flags). In case it succeeds in expanding the preceding memory
region, the function jumps to step 12.
vm_area_struct通过调用slab分配器函数为新的内存区域分配数据结构kmem_cache_alloc( )。
Allocates a vm_area_struct data structure for the
new memory region by invoking the kmem_cache_alloc( ) slab allocator
function.
初始化新的内存区域对象(由 指向
vma):
vma->vm_mm = 毫米;
vma->vm_start = 地址;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
vma->vm_page_prot = 保护映射[vm_flags & 0x0f];
vma->vm_ops = NULL;
vma->vm_pgoff = pgoff;
vma->vm_file = NULL;
vma->vm_private_data = NULL;
vma->vm_next = NULL;
INIT_LIST_HEAD(&vma->共享);Initializes the new memory region object (pointed to by
vma):
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
vma->vm_page_prot = protection_map[vm_flags & 0x0f];
vma->vm_ops = NULL;
vma->vm_pgoff = pgoff;
vma->vm_file = NULL;
vma->vm_private_data = NULL;
vma->vm_next = NULL;
INIT_LIST_HEAD(&vma->shared);如果MAP_SHARED设置了该标志(并且新的内存区域未映射磁盘上的文件),则该区域是共享匿名区域:调用shmem_zero_setup( )以对其进行初始化。共享匿名区域主要用于进程间通信;参见第19章“ IPC共享内存”部分。
If the MAP_SHARED flag is
set (and the new memory region doesn't map a file on disk), the
region is a shared anonymous region: invokes shmem_zero_setup( ) to initialize it.
Shared anonymous regions are mainly used for interprocess
communications; see the section "IPC Shared Memory"
in Chapter 19.
调用vma_link( )将新区域插入内存区域列表和红黑树中(请参阅前面的“内存区域处理”部分)。
Invokes vma_link( ) to
insert the new region in the memory region list and red-black tree
(see the earlier section "Memory Region
Handling").
total_vm增加内存描述符字段中存储的进程地址空间的大小。
Increases the size of the process address space stored in
the total_vm field of the
memory descriptor.
如果VM_LOCKED设置了该标志,它将调用make_pages_present(
)连续分配内存区域的所有页面并将它们锁定在 RAM 中:
如果(vm_flags 和 VM_LOCKED){
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
}该make_pages_present( )
函数依次调用get_user_pages( )如下:
写 = (vma->vm_flags & VM_WRITE) != 0;
get_user_pages(当前, 当前->mm, addr, len, write, 0, NULL, NULL);该函数循环访问和get_user_pages( )
之间页面的所有起始线性地址;对于它们中的每一个,它都会调用
以检查 的页表中是否存在到物理页的映射。如果不存在这样的物理页,则调用,正如我们将在“处理地址空间内的错误地址”部分中看到的那样,它将分配一个页框并根据内存区域描述符的字段设置其页表条目。addraddr+lenfollow_page( )currentget_user_pages(
)handle_mm_fault(
)vm_flags
If the VM_LOCKED flag is
set, it invokes make_pages_present(
) to allocate all the pages of the memory region in
succession and lock them in RAM:
if (vm_flags & VM_LOCKED) {
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
}The make_pages_present( )
function, in turn, invokes get_user_pages( ) as follows:
write = (vma->vm_flags & VM_WRITE) != 0;
get_user_pages(current, current->mm, addr, len, write, 0, NULL, NULL);The get_user_pages( )
function cycles through all starting linear addresses of the pages
between addr and addr+len; for each of them, it invokes
follow_page( ) to check whether
there is a mapping to a physical page in the current's Page Tables. If no such
physical page exists, get_user_pages(
) invokes handle_mm_fault(
), which, as we'll see in the section "Handling a Faulty Address
Inside the Address Space," allocates one page frame and
sets its Page Table entry according to the vm_flags field of the memory region
descriptor.
最后,它通过返回新内存区域的线性地址来终止。
Finally, it terminates by returning the linear address of the new memory region.
当内核必须从当前进程的地址空间中删除线性地址区间时,它会使用该do_munmap( )函数。mm参数是:进程内存描述符的地址、start区间的起始地址及其长度
len。待删除的区间通常不对应内存区域;它可以包含在一个内存区域中或跨越两个或多个区域。
When the kernel must delete a linear address interval
from the address space of the current process, it uses the do_munmap( ) function. The parameters are:
the address mm of the process's
memory descriptor, the starting address start of the interval, and its length
len. The interval to be deleted
does not usually correspond to a memory region; it may be included in
one memory region or span two or more regions.
该功能经历两个主要阶段。在第一阶段(步骤 1-6),它扫描进程拥有的内存区域列表,并取消线性地址区间中包含的所有区域与进程地址空间的链接。在第二阶段(步骤 7-12),该函数更新进程页表并删除第一阶段中识别的内存区域。该函数使用了split_vma(
)和unmap_region( )
函数,稍后将对此进行描述。do_munmap( )执行以下步骤:
The function goes through two main phases. In the
first phase (steps 1–6), it scans the list of memory regions owned
by the process and unlinks all regions included in the linear
address interval from the process address space. In the second phase
(steps 7–12), the function updates the process Page Tables and
removes the memory regions identified in the first phase. The
function makes use of the split_vma(
) and unmap_region( )
functions, which will be described later. do_munmap( ) executes the following
steps:
对参数值执行一些初步检查。如果线性地址间隔包含大于 的地址
TASK_SIZE,且start不是 4,096 的倍数,或者线性地址间隔的长度为零,则该函数返回错误代码-EINVAL。
Performs some preliminary checks on the parameter values.
If the linear address interval includes addresses greater than
TASK_SIZE, if start is not a multiple of 4,096, or
if the linear address interval has a zero length, the function
returns the error code -EINVAL.
mpnt定位在要删除的线性地址间隔 ( ) 之后结束的第一个内存区域(mpnt->end > start如果有):
mpnt = find_vma_prev(mm, 开始, &prev);
Locates the first memory region mpnt that ends after the linear
address interval to be deleted (mpnt->end > start), if
any:
mpnt = find_vma_prev(mm, start, &prev);
如果不存在这样的内存区域,或者该区域与线性地址区间不重叠,则无需执行任何操作,因为该区间内没有内存区域:
结束=开始+长度;
if (!mpnt || mpnt->vm_start >= 结束)
返回0;If there is no such memory region, or if the region does not overlap with the linear address interval, nothing has to be done because there is no memory region in the interval:
end = start + len;
if (!mpnt || mpnt->vm_start >= end)
return 0;如果线性地址区间在mpnt内存区域内开始,它会调用
split_vma( )(如下所述)将mpnt
内存区域分割为两个较小的区域:一个在区间外,另一个在区间内:
如果(开始> mpnt-> vm_start){
if (split_vma(mm, mpnt, 开始, 0))
返回-ENOMEM;
上一个 = mpnt;
}prev先前存储指向 之前的内存区域的指针的局部变量被mpnt更新,以便它指向mpnt— 即位于线性地址间隔之外的新内存区域。这样,prev仍然指向要移除的第一个内存区域之前的内存区域。
If the linear address interval starts inside the mpnt memory region, it invokes
split_vma( ) (described
below) to split the mpnt
memory region into two smaller regions: one outside the interval
and the other inside the interval:
if (start > mpnt->vm_start) {
if (split_vma(mm, mpnt, start, 0))
return -ENOMEM;
prev = mpnt;
}The prev local
variable, which previously stored the pointer to the memory
region preceding mpnt, is
updated so that it points to mpnt—that is, to the new memory region
lying outside the linear address interval. In this way, prev still points to the memory region
preceding the first memory region to be removed.
如果线性地址间隔在内存区域内结束,它会split_vma(
)再次调用将最后一个重叠内存区域分成两个较小的区域:一个在间隔内,另一个在间隔外:[ * ]
最后 = find_vma(mm, 结束);
if (最后 && 结束 > 最后->vm_start)){
if (split_vma(mm, 最后, 开始, 结束, 1))
返回-ENOMEM;
}If the linear address interval ends inside a memory
region, it invokes split_vma(
) once again to split the last overlapping memory
region into two smaller regions: one inside the interval and the
other outside the interval:[*]
last = find_vma(mm, end);
if (last && end > last->vm_start)){
if (split_vma(mm, last, start, end, 1))
return -ENOMEM;
}更新 的值,mpnt使其指向线性地址间隔中的第一个内存区域。如果prev是NULL——也就是说,没有前面的内存区域——第一个内存区域的地址取自mm->mmap:
mpnt = 上一个 ? 上一个->vm_next : mm->mmap;
Updates the value of mpnt so that it points to the first
memory region in the linear address interval. If prev is NULL—that is, there is no preceding
memory region—the address of the first memory region is taken
from mm->mmap:
mpnt = prev ? prev->vm_next : mm->mmap;
调用detach_vmas_to_be_unmapped( )以从进程的线性地址空间中删除线性地址间隔中包含的内存区域。该函数主要执行以下代码:
vma=mpnt;
insert_point = (上一个 ? &prev->vm_next : &mm->mmap);
做 {
rb_erase(&vma->vm_rb, &mm->mm_rb);
mm->map_count--;
tail_vma = vma;
vma = vma->下一个;
while (vma && vma->start < end);
*插入点= vma;
tail_vma->vm_next = NULL;
mm->map_cache = NULL;要删除的区域的描述符存储在一个有序列表中,该列表的头由局部变量指向mpnt(实际上,该列表只是原始进程的内存区域列表的一个片段)。
Invokes detach_vmas_to_be_unmapped( ) to
remove the memory regions included in the linear address
interval from the process's linear address space. This function
essentially executes the following code:
vma = mpnt;
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
do {
rb_erase(&vma->vm_rb, &mm->mm_rb);
mm->map_count--;
tail_vma = vma;
vma = vma->next;
} while (vma && vma->start < end);
*insertion_point = vma;
tail_vma->vm_next = NULL;
mm->map_cache = NULL;The descriptors of the regions to be removed are stored in
an ordered list, whose head is pointed to by the mpnt local variable (actually, this
list is just a fragment of the original process's list of memory
regions).
获取mm->page_table_lock自旋锁。
Gets the mm->page_table_lock spin
lock.
调用unmap_region( )
以清除覆盖线性地址间隔的页表条目并释放相应的页框(稍后讨论):
unmap_region(mm, mpnt, 上一个, 开始, 结束);
Invokes unmap_region( )
to clear the Page Table entries covering the linear address
interval and to free the corresponding page frames (discussed
later):
unmap_region(mm, mpnt, prev, start, end);
释放mm->page_table_lock自旋锁。
Releases the mm->page_table_lock spin
lock.
释放步骤 7 中构建的列表中收集的内存区域的描述符:
做 {
struct vm_area_struct * next = mpnt->vm_next;
unmap_vma(mm, mpnt);
mpnt = 下一个;
while (mpnt!= NULL);该unmap_vma( )
函数在列表中的每个内存区域上调用;它本质上执行以下步骤:
更新mm->total_vm和mm->locked_vm字段。
执行mm->unmap_area内存描述符的方法。根据进程的内存区域布局,此方法由arch_unmap_area( )或 由
实现
(请参阅前面的“内存区域处理”部分)。在这两种情况下,如果需要,该字段都会更新。arch_unmap_area_topdown(
)mm->free_area_cache
调用close
内存区域的方法(如果已定义)。
如果内存区域是匿名的,则该函数将其从以 为头的匿名内存区域列表中删除
mm->anon_vma。
调用kmem_cache_free(
)以释放内存区域描述符。
Releases the descriptors of the memory regions collected in the list built in step 7:
do {
struct vm_area_struct * next = mpnt->vm_next;
unmap_vma(mm, mpnt);
mpnt = next;
} while (mpnt != NULL);The unmap_vma( )
function is invoked on every memory region in the list; it
essentially executes the following steps:
Updates the mm->total_vm and mm->locked_vm fields.
Executes the mm->unmap_area method of the
memory descriptor. This method is implemented either by
arch_unmap_area( ) or by
arch_unmap_area_topdown(
), according to the memory region layout of the
process (see the earlier section "Memory Region
Handling"). In both cases, the mm->free_area_cache field is
updated, if needed.
Invokes the close
method of the memory region, if defined.
If the memory region is anonymous, the function
removes it from the anonymous memory region list headed at
mm->anon_vma.
Invokes kmem_cache_free(
) to release the memory region descriptor.
返回 0(成功)。
Returns 0 (success).
该函数的目的split_vma(
)是将与线性地址区间相交的内存区域分成两个较小的区域,一个在该区间之外,另一个在该区间之内。该函数接收四个参数:内存描述符指针、标识要分割区域的mm内存区域描述
符指针、指定区间与内存区域交点的地址以及指定交集是否发生在交点处的标志。间隔开始或结束时。该函数执行以下基本步骤:vmaaddrnew_below
The purpose of the split_vma(
) function is to split a memory region that intersects a
linear address interval into two smaller regions, one outside of the
interval and the other inside. The function receives four
parameters: a memory descriptor pointer mm, a memory area descriptor pointer
vma that identifies the region to
be split, an address addr that
specifies the intersection point between the interval and the memory
region, and a flag new_below that
specifies whether the intersection occurs at the beginning or at the
end of the interval. The function performs the following basic
steps:
调用kmem_cache_alloc(
)以获取附加vm_area_struct描述符,并将其地址存储在new局部变量中。如果没有可用的空闲内存,则返回-ENOMEM。
Invokes kmem_cache_alloc(
) to get an additional vm_area_struct descriptor, and stores
its address in the new local
variable. If no free memory is available, it returns -ENOMEM.
使用描述符字段new的内容初始化vma
描述符的字段。
Initializes the fields of the new descriptor with the contents of
the fields of the vma
descriptor.
如果该new_below标志为0,则线性地址间隔在该区域内开始vma,因此新区域必须放置在该vma区域之后。因此,该函数将new->vm_start和vma->vm_end字段设置为addr。
If the new_below flag
is 0, the linear address interval starts inside the vma region, so the new region must be
placed after the vma region.
Thus, the function sets both the new->vm_start and the vma->vm_end fields to addr.
相反,如果该new_below标志等于1,则线性地址间隔在该区域内结束vma,因此新区域必须放置在该vma区域之前。因此,该函数将new->vm_end和vma->vm_start字段设置为addr。
Conversely, if the new_below flag is equal to 1, the
linear address interval ends inside the vma region, so the new region must be
placed before the vma region.
Thus, the function sets both the new->vm_end and the vma->vm_start fields to addr.
如果open定义了新内存区域的方法,则该函数将执行它。
If the open method of
the new memory region is defined, the function executes
it.
将new内存区域描述符链接到mm->mmap内存区域列表和mm->mm_rb红黑树。此外,该函数还调整红黑树以适应内存区域的新大小vma。
Links the new memory
region descriptor to the mm->mmap list of memory regions and
to the mm->mm_rb red-black
tree. Moreover, the function adjusts the red-black tree to take
care of the new size of the memory region vma.
返回 0(成功)。
Returns 0 (success).
该unmap_region( )
函数遍历内存区域列表并释放属于它们的页框。它作用于五个参数:一个内存描述符指针mm,一个
vma指向要删除的第一个内存区域的描述符的指针,一个指向进程列表中prev前面的内存区域的
指针(请参阅 中的步骤 2 和 4 ),以及两个地址和
来分隔线性地址间隔被删除。该函数主要执行以下步骤:vmado_munmap()startend
The unmap_region( )
function walks through a list of memory regions and releases the
page frames belonging to them. It acts on five parameters: a memory
descriptor pointer mm, a pointer
vma to the descriptor of the
first memory region being removed, a pointer prev to the memory region preceding
vma in the process's list (see
steps 2 and 4 in do_munmap()),
and two addresses start and
end that delimit the linear
address interval being removed. The function essentially executes
the following steps:
调用lru_add_drain(
)(参见第 17 章)。
Invokes lru_add_drain(
) (see Chapter
17).
调用该tlb_gather_mmu(
)函数来初始化名为 的每 CPU 变量
mmu_gathers。的内容
mmu_gathers取决于体系结构:一般来说,变量应该存储成功更新进程页表条目所需的所有信息。在80×86架构中,该tlb_gather_mmu( )
函数只是将内存描述符指针的值保存mm在
mmu_gathers本地CPU的变量中。
Invokes the tlb_gather_mmu(
) function to initialize a per-CPU variable named
mmu_gathers. The contents of
mmu_gathers are
architecture-dependent: generally speaking, the variable should
store all information required for a successful updating of the
page table entries of a process. In the 80 × 86 architecture,
the tlb_gather_mmu( )
function simply saves the value of the mm memory descriptor pointer in the
mmu_gathers variable of the
local CPU.
将变量的地址存储mmu_gathers在tlb局部变量中。
Stores the address of the mmu_gathers variable in the tlb local variable.
调用unmap_vmas( )
扫描属于线性地址区间的所有页表项:如果只有一个CPU可用,则该函数
free_swap_and_cache( )
重复调用以释放相应的页(参见第17章);否则,函数将相应页面描述符的指针保存在mmu_gathers局部变量中。
Invokes unmap_vmas( )
to scan all Page Table entries belonging to the linear address
interval: if only one CPU is available, the function invokes
free_swap_and_cache( )
repeatedly to release the corresponding pages (see Chapter 17); otherwise, the
function saves the pointers of the corresponding page
descriptors in the mmu_gathers local variable.
调用free_pgtables(tlb,prev,start,end)尝试回收上一步中已清空的进程的页表。
Invokes free_pgtables(tlb,prev,start,end) to
try to reclaim the Page Tables of the process that have been
emptied in the previous step.
调用tlb_finish_mmu(tlb,start,end)以完成工作:依次,此函数:
调用刷新 TLB(请参阅第 2 章中的“处理硬件缓存和 TLB ”flush_tlb_mm(
)部分)。
在多处理器系统中,调用free_pages_and_swap_cache( )释放数据结构中已收集指针的页框mmu_gather。该函数在第 17 章中描述。
Invokes tlb_finish_mmu(tlb,start,end) to
finish the work: in turn, this function:
Invokes flush_tlb_mm(
) to flush the TLB (see the section "Handling the Hardware
Cache and the TLB" in Chapter 2).
In multiprocessor system, invokes free_pages_and_swap_cache( ) to
release the page frames whose pointers have been collected
in the mmu_gather data
structure. This function is described in Chapter 17.
[ * ]我们省略了对 NUMA 系统中使用的一些附加字段的描述。
[*] We omitted describing a few additional fields used in NUMA systems.
[ * ]删除线性地址间隔理论上可能会失败,因为没有可用内存可用于新的内存描述符。
[*] Removing a linear address interval may theoretically fail because no free memory is available for a new memory descriptor.
[ * ]您可能会认为该Page size位的这种使用是一种肮脏的伎俩,因为该位旨在指示真实的页面大小。但 Linux 可以逃脱欺骗,因为 80 × 86 芯片检查Page size页目录条目中的位,但不检查页表条目中的位。
[*] You might consider this use of the Page size bit to be a dirty trick,
because the bit was meant to indicate the real page size. But
Linux can get away with the deception because the 80 × 86 chip
checks the Page size bit in
Page Directory entries, but not in Page Table entries.
[ * ]实际上,def_flags内存描述符的字段仅由系统调用修改mlockall( ),可用于设置VM_LOCKED标志,从而锁定 RAM 中调用进程的所有未来页面。
[*] Actually, the def_flags field of the memory
descriptor is modified only by the mlockall( ) system call, which can
be used to set the VM_LOCKED flag, thus locking all
future pages of the calling process in RAM.
[ * ]如果线性地址间隔正确包含在内存区域内,则该区域必须替换为两个新的较小区域。当这种情况发生时,步骤4和步骤5将内存区域分成三个较小的区域:中间区域被破坏,而第一个和最后一个区域将被保留。
[*] If the linear address interval is properly contained inside a memory region, the region must be replaced by two new smaller regions. When this case occurs, step 4 and step 5 break the memory region in three smaller regions: the middle region is destroyed, while the first and the last ones will be preserved.
如前所述,Linux 页面错误异常处理程序必须区分由编程错误引起的异常和由对合法属于进程地址空间但尚未分配的页面的引用引起的异常。
As stated previously, the Linux Page Fault exception handler must distinguish exceptions caused by programming errors from those caused by a reference to a page that legitimately belongs to the process address space but simply hasn't been allocated yet.
内存区域描述符允许异常处理程序非常有效地执行其工作。该do_page_fault( )函数是 80 × 86 架构的缺页中断服务例程,它将导致缺页错误的线性地址与进程的内存区域进行比较current;因此,它可以根据图9-4所示的方案确定处理异常的正确方法。
The memory region descriptors allow the exception handler to
perform its job quite efficiently. The do_page_fault( ) function, which is the Page
Fault interrupt service routine for the 80 × 86 architecture, compares
the linear address that caused the Page Fault against the memory regions
of the current process; it can thus
determine the proper way to handle the exception according to the scheme
that is illustrated in Figure
9-4.
在实践中,事情要复杂得多,因为页面错误处理程序必须识别几个不适合整体方案的特定子情况,并且必须区分几种合法访问。处理程序的详细流程图如图9-5所示。
In practice, things are a lot more complex because the Page Fault handler must recognize several particular subcases that fit awkwardly into the overall scheme, and it must distinguish several kinds of legal access. A detailed flow diagram of the handler is illustrated in Figure 9-5.
标识符vmalloc_fault、
good_area、bad_area和no_context是其中出现的标签,do_page_fault( )可以帮助您将流程图的块与特定的代码行关联起来。
The identifiers vmalloc_fault,
good_area, bad_area, and no_context are labels appearing in do_page_fault( ) that should help you to
relate the blocks of the flow diagram to specific lines of code.
该do_ page_fault( )函数接受以下输入参数:
The do_ page_fault( ) function
accepts the following input parameters:
包含异常发生时微处理器寄存器值的结构regs的地址
。pt_regs
The regs address of a
pt_regs structure containing the
values of the microprocessor registers when the exception
occurred.
一个3位error_code,当异常发生时由控制单元压入堆栈(参见第4章“中断和异常的硬件处理” )。这些位的含义如下:
A 3-bit error_code, which
is pushed on the stack by the control unit when the exception
occurred (see "Hardware
Handling of Interrupts and Exceptions" in Chapter 4). The bits have the
following meanings:
If bit 0 is clear, the exception was caused by an access
to a page that is not present (the Present flag in the Page Table entry
is clear); otherwise, if bit 0 is set, the exception was caused
by an invalid access right.
If bit 1 is clear, the exception was caused by a read or execute access; if set, the exception was caused by a write access.
If bit 2 is clear, the exception occurred while the processor was in Kernel Mode; otherwise, it occurred in User Mode.
第一个操作do_ page_fault(
)包括读取导致页面错误的线性地址。当异常发生时,CPU控制单元将该值存储在cr2 控制寄存器:
The first operation of do_ page_fault(
) consists of reading the linear address that caused the Page
Fault. When the exception occurs, the CPU control unit stores that value
in the cr2 control register:
asm("movl %%cr2,%0":"=r" (地址));
if (regs->eflags & 0x00020200)
local_irq_enable();
tsk = 当前; asm("movl %%cr2,%0":"=r" (address));
if (regs->eflags & 0x00020200)
local_irq_enable( );
tsk = current;线性地址保存在address局部变量中。该函数还确保在故障之前启用本地中断或CPU运行在虚拟8086模式下启用本地中断,并将指向进程描述符的指针保存current在tsk本地变量中。
The linear address is saved in the address local variable. The function also
ensures that local interrupts are enabled if they were enabled before
the fault or the CPU was running in virtual-8086 mode, and saves the
pointers to the process descriptor of current in the tsk local variable.
如图9-5上方所示,do_ page_fault( )检查故障线性地址是否属于第4G字节:
As shown at the top of Figure 9-5, do_ page_fault( ) checks whether the faulty
linear address belongs to the fourth gigabyte:
信息.si_code = SEGV_MAPERR;
if (地址 >= TASK_SIZE ) {
if (!(错误代码 & 0x101))
转到vmalloc_fault;
转到 bad_area_nosemaphore;
} info.si_code = SEGV_MAPERR;
if (address >= TASK_SIZE ) {
if (!(error_code & 0x101))
goto vmalloc_fault;
goto bad_area_nosemaphore;
}如果异常是由内核尝试访问不存在的页框引起的,则会跳转到 label 处的代码vmalloc_fault,这会处理可能因在内核模式下访问不连续内存区域而导致的错误;我们将在后面的“处理非连续内存区域访问”部分中描述这种情况。否则,将跳转到标签处的代码,如后面的“处理地址空间外的错误地址bad_area_nosemaphore”部分所述。
If the exception was caused by the kernel trying to access a
nonexisting page frame, a jump is made to the code at label vmalloc_fault, which takes care of faults that
were likely caused by accessing a noncontiguous memory area in Kernel
Mode; we describe this case in the later section "Handling Noncontiguous Memory Area
Accesses." Otherwise, a jump is made to the code at the bad_area_nosemaphore label, described in the
later section "Handling a
Faulty Address Outside the Address Space."
接下来,处理程序检查异常是否在内核执行某些关键例程或运行内核线程时发生(请记住,mm进程描述符的字段始终NULL适用于内核线程)):
Next, the handler checks whether the exception occurred while the
kernel was executing some critical routine or running a kernel thread
(remember that the mm field of the
process descriptor is always NULL for
kernel threads ):
if (in_atomic( ) || !tsk->mm)
转到 bad_area_nosemaphore; if (in_atomic( ) || !tsk->mm)
goto bad_area_nosemaphore;in_atomic( )如果在以下任一条件成立时发生故障,则宏将生成值 1 :
The in_atomic( ) macro yields
the value one if the fault occurred while either one of the following
conditions holds:
内核正在执行中断处理程序或可延迟函数。
The kernel was executing an interrupt handler or a deferrable function.
The kernel was executing a critical region with kernel preemption disabled (see the section "Kernel Preemption" in Chapter 5).
如果页面错误确实发生在中断处理程序、可延迟函数、临界区域或内核线程中,则
do_ page_fault( )不会尝试将线性地址与 的内存区域进行比较current。内核线程从不使用下面的线性地址TASK_SIZE。同样,中断处理程序、可延迟函数和关键区域的代码不应使用下面的线性地址,TASK_SIZE因为这可能会阻塞当前进程。(有关局部变量的信息以及标签处代码的说明,请参阅本章后面的
“处理地址空间外的错误地址”一节。)infobad_area_nosemaphore
If the Page Fault did occur in an interrupt handler, in a
deferrable function, in a critical region, or in a kernel thread,
do_ page_fault( ) does not try to
compare the linear address with the memory regions of current. Kernel threads never use linear
addresses below TASK_SIZE. Similarly,
interrupt handlers, deferrable functions, and code of critical regions
should not use linear addresses below TASK_SIZE because this might block the current
process. (See the section "Handling a Faulty Address Outside
the Address Space" later in this chapter for information on the
info local variable and a description
of the code at the bad_area_nosemaphore label.)
假设页面错误没有发生在中断处理程序、可延迟函数、关键区域或内核线程中。然后,该函数必须检查进程拥有的内存区域,以确定错误的线性地址是否包含在进程地址空间中。为此,它必须获取
mmap_sem进程的读/写信号量:
Let's suppose that the Page Fault did not occur in an interrupt
handler, in a deferrable function, in a critical region, or in a kernel
thread. Then the function must inspect the memory regions owned by the
process to determine whether the faulty linear address is included in
the process address space. In order to this, it must acquire the
mmap_sem read/write semaphore of the
process:
if (!down_read_trylock(&tsk->mm>mmap_sem)) {
if ((错误代码 & 4) == 0 &&
!search_exception_table(regs->eip))
转到 bad_area_nosemaphore;
down_read(&tsk->mm->mmap_sem);
} if (!down_read_trylock(&tsk->mm>mmap_sem)) {
if ((error_code & 4) == 0 &&
!search_exception_table(regs->eip))
goto bad_area_nosemaphore;
down_read(&tsk->mm->mmap_sem);
}如果可以排除内核bug和硬件故障,那么mmap_sem当发生PageFault时,当前进程还没有获得用于写入的信号量。然而,do_page_fault(
)需要确保这确实是真的,因为否则会发生死锁。因此,该函数使用了down_read_trylock( )代替(请参阅第 5 章中的“读/写信号量”down_read( )部分)。如果信号量关闭并且在内核模式下发生页面错误,
则确定是否在使用已传递给内核作为系统调用参数的某些线性地址时发生异常(请参阅下一节“do_page_fault( )处理地址空间之外的错误地址”)。在这种情况下,do_page_fault( )可以肯定地知道该信号量由另一个进程拥有——因为每个系统调用服务例程都会小心地避免在mmap_sem访问用户模式地址空间之前获取用于写入的信号量——因此函数会等待信号量被释放,否则,页面错误是由于内核错误或严重的硬件问题造成的,因此函数会跳转到标签处bad_area_nosemaphore。
If kernel bugs and hardware malfunctioning can be ruled out, the
current process has not already acquired the mmap_sem semaphore for writing when the Page
Fault occurs. However, do_page_fault(
) wants to be sure that this is actually true, because
otherwise a deadlock would occur. For that reason, the function makes
use of down_read_trylock( ) instead
of down_read( ) (see the section
"Read/Write
Semaphores" in Chapter
5). If the semaphore is closed and the Page Fault occurred in
Kernel Mode, do_page_fault( )
determines whether the exception occurred while using some linear
address that has been passed to the kernel as a parameter of a system
call (see the next section "Handling a Faulty Address Outside
the Address Space"). In this case, do_page_fault( ) knows for sure that the
semaphore is owned by another process—because every system call service
routine carefully avoids acquiring the mmap_sem semaphore for writing before
accessing the User Mode address space—so the function waits until the
semaphore is released. Otherwise, the Page Fault is due to a kernel bug
or to a serious hardware problem, so the function jumps to the bad_area_nosemaphore label.
假设mmap_sem
信号量已被安全获取以供读取。现在do_page_fault( )寻找包含错误线性地址的内存区域:
Let's assume that the mmap_sem
semaphore has been safely acquired for reading. Now do_page_fault( ) looks for a memory region
containing the faulty linear address:
vma = find_vma(tsk->mm, 地址);
如果(!vma)
转到坏区;
if (vma->vm_start <= 地址)
转到好区; vma = find_vma(tsk->mm, address);
if (!vma)
goto bad_area;
if (vma->vm_start <= address)
goto good_area;如果vma是NULL,则没有在 后结束的内存区域
address,因此错误地址肯定是坏的。另一方面,如果第一个内存区域在address包含之后结束address,则函数跳转到标签处的代码good_area。
If vma is NULL, there is no memory region ending after
address, and thus the faulty address
is certainly bad. On the other hand, if the first memory region ending
after address includes address, the function jumps to the code at
label good_area.
如果两个“if”条件都不满足,则函数确定address不包含在任何内存区域中;但是,它必须执行额外的检查,因为错误的地址可能是由进程的用户模式堆栈上的pushor指令引起的。pusha
If none of the two "if" conditions are satisfied, the function has
determined that address is not
included in any memory region; however, it must perform an additional
check, because the faulty address may have been caused by a push or pusha instruction on the User Mode stack of
the process.
让我们简短地题外话来解释一下堆栈是如何映射到内存区域的。每个包含堆栈的区域都会向低地址扩展;它的VM_GROWSDOWN
标志已设置,因此其字段的值vm_end保持固定,但其字段的值vm_start可能会减少。区域边界包括但不精确界定用户模式堆栈的当前大小。产生模糊因素的原因有:
Let's make a short digression to explain how stacks are mapped
into memory regions. Each region that contains a stack expands toward
lower addresses; its VM_GROWSDOWN
flag is set, so the value of its vm_end field remains fixed while the value of
its vm_start field may be decreased.
The region boundaries include, but do not delimit precisely, the current
size of the User Mode stack. The reasons for the fuzz factor are:
区域大小是 4 KB 的倍数(它必须包括完整的页面),而堆栈大小是任意的。
The region size is a multiple of 4 KB (it must include complete pages) while the stack size is arbitrary.
分配给某个区域的页框永远不会被释放,直到该区域被删除;特别是,包含栈的区域的字段值vm_start只能减小;它永远不会增加。即使进程执行一系列pop
指令,区域大小也保持不变。
Page frames assigned to a region are never released until the
region is deleted; in particular, the value of the vm_start field of a region that includes a
stack can only decrease; it can never increase. Even if the process
executes a series of pop
instructions, the region size remains unchanged.
现在应该清楚,填满分配给其堆栈的最后一个页帧的进程如何可能导致页面错误异常:引用
push区域之外的地址(以及不存在的页帧)。请注意,这种异常不是由编程错误引起的;而是由程序错误引起的。因此它必须由页面错误处理程序单独处理。
It should now be clear how a process that has filled up the last
page frame allocated to its stack may cause a Page Fault exception: the
push refers to an address outside of
the region (and to a nonexistent page frame). Notice that this kind of
exception is not caused by a programming error; thus it must be handled
separately by the Page Fault handler.
我们现在返回到 的描述do_
page_fault( ),它检查前面描述的情况:
We now return to the description of do_
page_fault( ), which checks for the case described
previously:
if (!(vma->vm_flags & VM_GROWSDOWN))
转到坏区;
if (error_code & 4 /* 用户模式 /
&& 地址 + 32 < regs->esp)
转到坏区;
if (expand_stack(vma, 地址))
转到坏区;
转到好区; if (!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
if (error_code & 4 /* User Mode */
&& address + 32 < regs->esp)
goto bad_area;
if (expand_stack(vma, address))
goto bad_area;
goto good_area;如果VM_GROWSDOWN该区域的标志已设置并且在用户模式下发生异常,则该函数检查是否address小于regs->esp堆栈指针(应该只小一点)。由于一些与堆栈相关的汇编语言指令(例如)仅在内存访问后pusha执行寄存器递减,因此为进程授予了 32 字节的容差区间。esp如果地址足够高(在允许的容差范围内),代码将调用该
expand_stack( )函数来检查是否允许进程扩展其堆栈和地址空间;如果一切正常,则设置tovm_start字段并返回 0;否则,返回
.vmaaddress-ENOMEM
If the VM_GROWSDOWN flag of the
region is set and the exception occurred in User Mode, the function
checks whether address is smaller
than the regs->esp stack pointer
(it should be only a little smaller). Because a few stack-related
assembly language instructions (such as pusha) perform a decrement of the esp register only after the memory access, a
32-byte tolerance interval is granted to the process. If the address is
high enough (within the tolerance granted), the code invokes the
expand_stack( ) function to check
whether the process is allowed to extend both its stack and its address
space; if everything is OK, it sets the vm_start field of vma to address and returns 0; otherwise, it returns
-ENOMEM.
VM_GROWSDOWN请注意,只要设置了该区域的标志并且在用户模式下没有发生异常,前面的代码就会跳过容差检查。这些条件意味着内核正在对用户模式堆栈进行寻址并且代码应该始终运行expand_stack(
)。
Note that the preceding code skips the tolerance check whenever
the VM_GROWSDOWN flag of the region
is set and the exception did not occur in User Mode. These conditions
mean that the kernel is addressing the User Mode stack and that the code
should always run expand_stack(
).
如果address不属于进程地址空间,do_page_fault( )则继续执行标签处的语句bad_area。如果错误发生在用户模式,它会发送一个SIGSEGV信号(参见第 11 章中的“生成信号”
current部分)并终止:
If address does not
belong to the process address space, do_page_fault( ) proceeds to execute the
statements at the label bad_area.
If the error occurred in User Mode, it sends a SIGSEGV signal to current (see the section "Generating a Signal" in
Chapter 11) and
terminates:
坏区:
up_read(&tsk->mm->mmap_sem);
bad_area_nsemaphore:
if (error_code & 4) { /* 用户模式 /
tsk->thread.cr2 = 地址;
tsk->thread.error_code = error_code | tsk->thread.error_code = error_code | (地址 >= TASK_SIZE);
tsk->thread.trap_no = 14;
信息.si_signo = SIGSEGV;
信息.si_errno = 0;
info.si_addr = (void *) 地址;
force_sig_info(SIGSEGV, &info, tsk);
返回;
} bad_area:
up_read(&tsk->mm->mmap_sem);
bad_area_nosemaphore:
if (error_code & 4) { /* User Mode */
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code | (address >= TASK_SIZE);
tsk->thread.trap_no = 14;
info.si_signo = SIGSEGV;
info.si_errno = 0;
info.si_addr = (void *) address;
force_sig_info(SIGSEGV, &info, tsk);
return;
}该force_sig_info( )
函数确保进程不会忽略或阻止
SIGSEGV信号,并将信号发送到用户模式进程,同时在局部变量中传递一些附加信息(请参阅第11章中的“生成信号”
info部分)。该字段已设置为
(如果异常是由于不存在的页框架)或(如果异常是由于对现有页框架的无效访问)。info.si_codeSEGV_MAPERRSEGV_ACCERR
The force_sig_info( )
function makes sure that the process does not ignore or block the
SIGSEGV signal, and sends the
signal to the User Mode process while passing some additional
information in the info local
variable (see the section "Generating a Signal" in
Chapter 11). The info.si_code field is already set to
SEGV_MAPERR (if the exception was
due to a nonexisting page frame) or to SEGV_ACCERR (if the exception was due to an
invalid access to an existing page frame).
如果异常发生在内核模式(bit 2清零error_code),还有两种选择:
If the exception occurred in Kernel Mode (bit 2 of error_code is clear), there are still two
alternatives:
使用某些已传递给内核作为系统调用参数的线性地址时发生异常。
The exception occurred while using some linear address that has been passed to the kernel as a parameter of a system call.
该异常是由于真正的内核错误造成的。
The exception is due to a real kernel bug.
该函数对这两种选择的区别如下:
The function distinguishes these two alternatives as follows:
无上下文:
if ((fixup = search_exception_table(regs->eip)) != 0) {
regs->eip = 修复;
返回;
} no_context:
if ((fixup = search_exception_table(regs->eip)) != 0) {
regs->eip = fixup;
return;
}在第一种情况下,它跳转到“修复代码”,该代码通常会向系统调用处理程序发送SIGSEGV信号或使用适当的错误代码终止系统调用处理程序(请参阅第 10 章中的“动态地址检查:修复代码”部分) 。current
In the first case, it jumps to a "fixup code," which typically
sends a SIGSEGV signal to current or terminates a system call handler
with a proper error code (see the section "Dynamic Address Checking: The
Fix-up Code" in Chapter
10).
在第二种情况下,该函数在控制台和系统消息缓冲区上打印 CPU 寄存器和内核模式堆栈的完整转储;然后它通过调用该do_exit( )函数来终止当前进程(参见第 20 章)。这就是所谓的“Kernel oops”错误,以显示的消息命名。内核黑客可以使用转储的值来重建触发错误的条件,从而找到并纠正它。
In the second case, the function prints a complete dump of the
CPU registers and of the Kernel Mode stack both on the console and on
a system message buffer; it then kills the current process by invoking
the do_exit( ) function (see Chapter 20). This is the
so-called "Kernel oops" error, named after the
message displayed. The dumped values can be used by kernel hackers to
reconstruct the conditions that triggered the bug, and thus find and
correct it.
如果address属于进程地址空间,do_ page_fault(
)则继续执行标记为 的语句good_area:
If address belongs to
the process address space, do_ page_fault(
) proceeds to the statement labeled good_area:
好区:
信息.si_code = SEGV_ACCERR;
写=0;
if (error_code & 2) { /* 写访问 */
if (!(vma->vm_flags & VM_WRITE))
转到坏区;
写++;
} else /* 读权限 */
if ((error_code & 1) || !(vma->vm_flags & (VM_READ | VM_EXEC)))
转到坏区; good_area:
info.si_code = SEGV_ACCERR;
write = 0;
if (error_code & 2) { /* write access */
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
write++;
} else /* read access */
if ((error_code & 1) || !(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;如果异常是由写访问引起的,该函数将检查内存区域是否可写。如果没有,则跳转到
bad_area代码;如果是,则将局部变量设置
write为 1。
If the exception was caused by a write access, the function
checks whether the memory region is writable. If not, it jumps to the
bad_area code; if so, it sets the
write local variable to 1.
如果异常是由读取或执行访问引起的,该函数将检查该页是否已存在于 RAM 中。在这种情况下,发生异常是因为进程试图在用户模式下访问特权页框(其标志User/Supervisor为清除),因此函数跳转到代码bad_area。[ * ]如果页面不存在,该函数还会检查内存区域是否可读或可执行。
If the exception was caused by a read or execute access, the
function checks whether the page is already present in RAM. In this
case, the exception occurred because the process tried to access a
privileged page frame (one whose User/Supervisor flag is clear) in User Mode,
so the function jumps to the bad_area code.[*] If the page is not present, the function also checks
whether the memory region is readable or executable.
如果内存区域访问权限与导致异常的访问类型匹配,则handle_mm_fault(
)调用该函数来分配新的页框:
If the memory region access rights match the access type that
caused the exception, the handle_mm_fault(
) function is invoked to allocate a new page frame:
存活:
ret = handle_mm_fault(tsk->mm, vma, 地址, 写入);
if (ret == VM_FAULT_MINOR || ret == VM_FAULT_MAJOR) {
if (ret == VM_FAULT_MINOR) tsk->min_flt++; 否则 tsk->maj_flt++;
up_read(&tsk->mm->mmap_sem);
返回;
} survive:
ret = handle_mm_fault(tsk->mm, vma, address, write);
if (ret == VM_FAULT_MINOR || ret == VM_FAULT_MAJOR) {
if (ret == VM_FAULT_MINOR) tsk->min_flt++; else tsk->maj_flt++;
up_read(&tsk->mm->mmap_sem);
return;
}如果函数成功为进程分配新的页框,则handle_mm_fault( )
返回VM_FAULT_MINORor
。VM_FAULT_MAJOR该值VM_FAULT_MINOR表明缺页已经被处理,并且没有阻塞当前进程;这种页面错误称为轻微错误。该值
VM_FAULT_MAJOR表示页面错误迫使当前进程进入睡眠状态(很可能是因为用从磁盘读取的数据填充分配给该进程的页框时花费了时间);阻塞当前进程的页面错误称为主要错误。该函数还可以返回VM_FAULT_OOM(对于内存不足)或VM_FAULT_SIGBUS
(对于每个其他错误)。
The handle_mm_fault( )
function returns VM_FAULT_MINOR or
VM_FAULT_MAJOR if it succeeded in
allocating a new page frame for the process. The value VM_FAULT_MINOR indicates that the Page Fault
has been handled without blocking the current process; this kind of
Page Fault is called minor fault. The value
VM_FAULT_MAJOR indicates that the
Page Fault forced the current process to sleep (most likely because
time was spent while filling the page frame assigned to the process
with data read from disk); a Page Fault that blocks the current
process is called a major fault. The function can
also return VM_FAULT_OOM (for not
enough memory) or VM_FAULT_SIGBUS
(for every other error).
如果handle_mm_fault( )返回值VM_FAULT_SIGBUS,
SIGBUS则会向进程发送信号:
If handle_mm_fault( ) returns
the value VM_FAULT_SIGBUS, a
SIGBUS signal is sent to the
process:
如果(ret == VM_FAULT_SIGBUS){
do_sigbus:
up_read(&tsk->mm->mmap_sem);
if (!(error_code & 4)) /* 内核模式 */
转到无上下文;
tsk->thread.cr2 = 地址;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
信息.si_signo = SIGBUS;
信息.si_errno = 0;
信息.si_code = BUS_ADRERR;
info.si_addr = (void *) 地址;
force_sig_info(SIGBUS, &info, tsk);
} if (ret == VM_FAULT_SIGBUS) {
do_sigbus:
up_read(&tsk->mm->mmap_sem);
if (!(error_code & 4)) /* Kernel Mode */
goto no_context;
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
info.si_signo = SIGBUS;
info.si_errno = 0;
info.si_code = BUS_ADRERR;
info.si_addr = (void *) address;
force_sig_info(SIGBUS, &info, tsk);
}如果handle_mm_fault( )无法分配新的页框,则返回值VM_FAULT_OOM;在这种情况下,内核通常会杀死当前进程。然而,如果current是init
进程,它只是被放在运行队列的末尾并调用调度程序;一旦init恢复执行,handle_mm_fault( )就会再次执行:
If handle_mm_fault( ) cannot
allocate the new page frame, it returns the value VM_FAULT_OOM; in this case, the kernel
usually kills the current process. However, if current is the init
process, it is just put at the end of the run queue and the scheduler
is invoked; once init resumes its
execution, handle_mm_fault( ) is
executed again:
如果(ret == VM_FAULT_OOM){
内存不足:
up_read(&tsk->mm->mmap_sem);
if (tsk->pid != 1) {
if (error_code & 4) /* 用户模式 /
do_exit(SIGKILL);
转到无上下文;
}
屈服();
down_read(&tsk->mm->mmap_sem);
去生存;
} if (ret == VM_FAULT_OOM) {
out_of_memory:
up_read(&tsk->mm->mmap_sem);
if (tsk->pid != 1) {
if (error_code & 4) /* User Mode */
do_exit(SIGKILL);
goto no_context;
}
yield();
down_read(&tsk->mm->mmap_sem);
goto survive;
}该handle_mm_fault( )
函数作用于四个参数:
The handle_mm_fault( )
function acts on four parameters:
mmmm指向发生异常时 CPU 上运行的进程的内存描述符的指针
A pointer to the memory descriptor of the process that was running on the CPU when the exception occurred
vmavma指向内存区域描述符的指针,包括引起异常的线性地址
A pointer to the descriptor of the memory region, including the linear address that caused the exception
addressaddress引起异常的线性地址
The linear address that caused the exception
write_accesswrite_accesstsk
如果尝试写入则设置为 1,address
如果tsk尝试读取或执行则设置为0
Set to 1 if tsk
attempted to write in address
and to 0 if tsk attempted to
read or execute it
该函数首先检查用于映射的页中间目录和页表是否address存在。即使address属于进程地址空间,相应的页表也可能尚未分配,因此分配它们的任务优先于其他一切:
The function starts by checking whether the Page Middle
Directory and the Page Table used to map address exist. Even if address belongs to the process address
space, the corresponding Page Tables might not have been allocated, so
the task of allocating them precedes everything else:
pgd = pgd_offset(mm, 地址);
spin_lock(&mm->page_table_lock);
pud = pud_alloc(mm, pgd, 地址);
如果(普德){
pmd = pmd_alloc(mm, pud, 地址);
如果(下午){
pte = pte_alloc_map(mm, pmd, 地址);
如果(pte)
返回handle_pte_fault(mm, vma, 地址,
write_access、pte、pmd);
}
}
spin_unlock(&mm->page_table_lock);
返回VM_FAULT_OOM; pgd = pgd_offset(mm, address);
spin_lock(&mm->page_table_lock);
pud = pud_alloc(mm, pgd, address);
if (pud) {
pmd = pmd_alloc(mm, pud, address);
if (pmd) {
pte = pte_alloc_map(mm, pmd, address);
if (pte)
return handle_pte_fault(mm, vma, address,
write_access, pte, pmd);
}
}
spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;局部pgd变量包含引用的页面全局目录条目address;pud_alloc(
)如果需要,将调用和pmd_alloc( )来分别分配新的页面上层目录和新的页面中间目录。如果需要,然后调用[ * ]来分配新的页表。 pte_alloc_map( )如果两个操作都成功,则pte
局部变量指向引用的页表条目address。然后调用该handle_pte_fault( )函数来检查对应的页表条目address并确定如何为进程分配新的页框:
The pgd local variable
contains the Page Global Directory entry that refers to address; pud_alloc(
) and pmd_alloc( ) are
invoked to allocate, if needed, a new Page Upper Directory and a new
Page Middle Directory, respectively.[*] pte_alloc_map( ) is
then invoked to allocate, if needed, a new Page Table. If both
operations are successful, the pte
local variable points to the Page Table entry that refers to address. The handle_pte_fault( ) function is then invoked
to inspect the Page Table entry corresponding to address and to determine how to allocate a
new page frame for the process:
如果访问的页面不存在(即,如果它尚未存储在任何页框架中),则内核会分配一个新的页框架并正确初始化它;这种技术称为请求分页。
If the accessed page is not present—that is, if it is not already stored in any page frame—the kernel allocates a new page frame and initializes it properly; this technique is called demand paging .
如果访问的页面存在但被标记为只读(即,如果它已经存储在页框架中),则内核分配新的页框架并通过复制旧页框架数据来初始化其内容;这种技术称为写入时复制。
If the accessed page is present but is marked read-only—i.e., if it is already stored in a page frame—the kernel allocates a new page frame and initializes its contents by copying the old page frame data; this technique is called Copy On Write.
术语“请求分页”表示一种动态内存分配技术,该技术包括将页帧分配推迟到最后可能的时刻,直到进程尝试寻址 RAM 中不存在的页面,从而导致页面错误异常。
The term demand paging denotes a dynamic memory allocation technique that consists of deferring page frame allocation until the last possible moment—until the process attempts to address a page that is not present in RAM, thus causing a Page Fault exception.
请求调页背后的动机是进程不会从一开始就寻址其地址空间中包含的所有地址;事实上,其中一些地址可能永远不会被进程使用。此外,程序局部性原则(参见第2章中的“硬件缓存”部分)确保在程序执行的每个阶段,只有进程页面的一小部分被真正引用,因此包含暂时无用页面的页面框架可以被其他进程使用。因此,请求分页优于全局分配(从一开始就将所有页框分配给进程,并将它们保留在内存中直到程序终止),因为它增加了系统中空闲页框的平均数量,因此可以更好地利用可用的空闲内存。从另一个角度来看,它可以让系统整体在相同数量的 RAM 下获得更好的吞吐量。
The motivation behind demand paging is that processes do not address all the addresses included in their address space right from the start; in fact, some of these addresses may never be used by the process. Moreover, the program locality principle (see the section "Hardware Cache" in Chapter 2) ensures that, at each stage of program execution, only a small subset of the process pages are really referenced, and therefore the page frames containing the temporarily useless pages can be used by other processes. Demand paging is thus preferable to global allocation (assigning all page frames to the process right from the start and leaving them in memory until program termination), because it increases the average number of free page frames in the system and therefore allows better use of the available free memory. From another viewpoint, it allows the system as a whole to get better throughput with the same amount of RAM.
所有这些好处的代价是系统开销:每个由需求调页引发的缺页异常都必须由内核处理,从而浪费了 CPU 周期。幸运的是,局部性原则确保一旦一个进程开始处理一组页面,它就会在相当长的一段时间内坚持使用它们,而不处理其他页面。因此,页面错误异常可以被视为罕见事件。
The price to pay for all these good things is system overhead: each Page Fault exception induced by demand paging must be handled by the kernel, thus wasting CPU cycles. Fortunately, the locality principle ensures that once a process starts working with a group of pages, it sticks with them without addressing other pages for quite a while. Thus, Page Fault exceptions may be considered rare events.
被寻址的页面可能不存在于主存储器中,因为该页面从未被进程访问过,或者因为相应的页框已被内核回收(参见第17章)。
An addressed page may not be present in main memory either because the page was never accessed by the process, or because the corresponding page frame has been reclaimed by the kernel (see Chapter 17).
在这两种情况下,页面错误处理程序都必须为进程分配一个新的页框。然而,如何初始化该页框取决于页面的类型以及该页面之前是否被进程访问过。尤其:
In both cases, the page fault handler must assign a new page frame to the process. How this page frame is initialized, however, depends on the kind of page and on whether the page was previously accessed by the process. In particular:
要么该页从未被进程访问过并且它不映射磁盘文件,要么该页映射了磁盘文件。内核可以识别这些情况,因为页表条目填充了零,即pte_none
宏返回值 1。
Either the page was never accessed by the process and it
does not map a disk file, or the page maps a disk file. The kernel
can recognize these cases because the Page Table entry is filled
with zeros—i.e., the pte_none
macro returns the value 1.
该页属于非线性磁盘文件映射(参见第16章“非线性内存映射”一节)。内核可以识别这种情况,因为标志被清除并且标志被设置——即宏返回值1。PresentDirtypte_file
The page belongs to a non-linear disk file mapping (see the
section "Non-Linear
Memory Mappings" in Chapter 16). The kernel can
recognize this case, because the Present flag is cleared and the Dirty flag is set—i.e., the pte_file macro returns the value
1.
该页面已被该进程访问,但其内容暂时保存在磁盘上。内核可以识别这种情况,因为页表条目没有用零填充,但Present和Dirty标志被清除。
The page was already accessed by the process, but its
content is temporarily saved on disk. The kernel can recognize
this case because the Page Table entry is not filled with zeros,
but the Present and Dirty flags are cleared.
因此,该handle_ pte_fault(
)函数能够通过检查引用的页表条目来区分这三种情况address:
Thus, the handle_ pte_fault(
) function is able to distinguish the three cases by
inspecting the Page Table entry that refers to address:
条目=*pte;
如果(!pte_present(条目)){
if (pte_none(条目))
返回 do_no_page(mm, vma, 地址, write_access, pte, pmd);
if (pte_file(条目))
返回 do_file_page(mm, vma, 地址, write_access, pte, pmd);
返回 do_swap_page(mm, vma, 地址, pte, pmd, 条目, write_access);
} entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry))
return do_no_page(mm, vma, address, write_access, pte, pmd);
if (pte_file(entry))
return do_file_page(mm, vma, address, write_access, pte, pmd);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}我们将分别在第 16 章和第 17 章中研究案例 2 和案例 3。
We'll examine cases 2 and 3 in Chapter 16 and in Chapter 17, respectively.
在情况 1 中,当该页从未被访问过或该页线性映射磁盘文件时,do_no_page( )
将调用该函数。有两种方法可以加载丢失的页面,具体取决于页面是否映射到磁盘文件。nopage该函数通过检查内存区域对象的方法来确定这一点vma,该对象指向将丢失的页面从磁盘加载到 RAM 的函数(如果该页面映射到文件)。因此,可能性是:
In case 1, when the page was never accessed or the page linearly
maps a disk file, the do_no_page( )
function is invoked. There are two ways to load the missing page,
depending on whether the page is mapped to a disk file. The function
determines this by checking the nopage method of the vma memory region object, which points to
the function that loads the missing page from disk into RAM if the
page is mapped to a file. Therefore, the possibilities are:
场vma->vm_ops->nopage则不然
NULL。在这种情况下,内存区域映射一个磁盘文件,字段指向加载页面的函数。这种情况在第 16 章的“内存映射的请求分页”部分和第 19 章的“ IPC 共享内存”部分中进行了介绍。
The vma->vm_ops->nopage field is not
NULL. In this case, the memory
region maps a disk file and the field points to the function that
loads the page. This case is covered in the section "Demand Paging for Memory
Mapping" in Chapter
16 and in the section "IPC Shared Memory"
in Chapter 19.
要么是vma->vm_ops
场,要么vma->vm_ops->nopage是场
NULL。在这种情况下,内存区域不会映射磁盘上的文件,即,它是
匿名映射 。因此,do_no_ page(
)调用该do_anonymous_page( )函数来获取新的页框:
if (!vma->vm_ops || !vma->vm_ops->nopage)
返回 do_anonymous_page(mm, vma, page_table, pmd,
写访问,地址);Either the vma->vm_ops
field or the vma->vm_ops->nopage field is
NULL. In this case, the memory
region does not map a file on disk—i.e., it is an
anonymous mapping . Thus, do_no_ page(
) invokes the do_anonymous_page( ) function to get a
new page frame:
if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table, pmd,
write_access, address);函数do_anonymous_page( )
[ * ]分别处理写请求和读请求:
The do_anonymous_page( )
function[*] handles write and read requests separately:
如果(写访问){
pte_unmap(页表);
spin_unlock(&mm->page_table_lock);
页面 = alloc_page(GFP_HIGHUSER | _ _GFP_ZERO);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
毫米->rss++;
条目 = Maybe_mkwrite(pte_mkdirty(mk_pte(页面,
vma->vm_page_prot)), vma);
lru_cache_add_active(页);
设置页面引用(页面);
set_pte(页表,条目);
pte_unmap(页表);
spin_unlock(&mm->page_table_lock);
返回VM_FAULT_MINOR;
} if (write_access) {
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
page = alloc_page(GFP_HIGHUSER | _ _GFP_ZERO);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)), vma);
lru_cache_add_active(page);
SetPageReferenced(page);
set_pte(page_table, entry);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}该宏的第一次执行会释放调用
该函数之前pte_unmap建立的页表项的高内存物理地址的临时内核映射(参见第2章“页表处理”部分中的表2-7)。以下一对 or和宏获取并释放相同的临时内核映射。在调用之前必须释放临时内核映射,因为该函数可能会阻塞当前进程。pte_offset_maphandle_pte_fault(
)pte_offset_mappte_unmapalloc_page(
)
The first execution of the pte_unmap macro releases the temporary
kernel mapping for the high-memory physical address of the Page Table
entry established by pte_offset_map
before invoking the handle_pte_fault(
) function (see Table 2-7 in the section
"Page Table
Handling" in Chapter
2). The following pair or pte_offset_map and pte_unmap macros acquires and releases the
same temporary kernel mapping. The temporary kernel mapping has to be
released before invoking alloc_page(
), because this function might block the current
process.
该函数增加rss内存描述符的字段以跟踪分配给进程的页帧数。然后,页表条目被设置为页框的物理地址,该地址被标记为可写[ † ]和脏的。该lru_cache_add_active( )函数将新的页框插入到与交换相关的数据结构中;我们将在
第 17 章中讨论它。
The function increases the rss field of the memory descriptor to keep
track of the number of page frames allocated to the process. The Page
Table entry is then set to the physical address of the page frame,
which is marked as writable[†] and dirty. The lru_cache_add_active( ) function inserts the
new page frame in the swap-related data structures; we discuss it in
Chapter 17.
相反,在处理读访问时,页面的内容是无关紧要的,因为该进程是第一次对其进行寻址。向进程提供一个充满零的页面比向进程提供充满其他进程写入的信息的旧页面更安全。Linux 在需求调页的精神上又向前迈进了一步。不需要立即为进程分配一个新的充满零的页框,因为我们不妨给它一个称为零页的现有页面 ,从而推迟进一步的页框分配。零页在内核初始化期间静态分配在
empty_zero_page变量中(一个 4,096 字节的数组,用零填充)。
Conversely, when handling a read access, the content of the page
is irrelevant because the process is addressing it for the first time.
It is safer to give a page filled with zeros to the process rather
than an old page filled with information written by some other
process. Linux goes one step further in the spirit of demand paging.
There is no need to assign a new page frame filled with zeros to the
process right away, because we might as well give it an existing page
called zero page , thus deferring further page frame allocation. The
zero page is allocated statically during kernel initialization in the
empty_zero_page variable (an array
of 4,096 bytes filled with zeros).
页表条目因此被设置为零页的物理地址:
The Page Table entry is thus set with the physical address of the zero page:
条目 = pte_wrprotect(mk_pte(virt_to_page(empty_zero_page),
vma->vm_page_prot));
set_pte(页表,条目);
spin_unlock(&mm->page_table_lock);
返回VM_FAULT_MINOR; entry = pte_wrprotect(mk_pte(virt_to_page(empty_zero_page),
vma->vm_page_prot));
set_pte(page_table, entry);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;由于该页被标记为不可写,因此如果进程尝试写入该页,则会激活“写入时复制”机制。只有这样,进程才会获得自己的页面进行写入。该机制将在下一节中描述。
Because the page is marked as nonwritable, if the process attempts to write in it, the Copy On Write mechanism is activated. Only then does the process get a page of its own to write in. The mechanism is described in the next section.
第一代 Unix 系统以一种相当笨拙的方式实现进程创建:当fork(
)发出系统调用时,内核按照字面意义复制整个父地址空间,并将副本分配给子进程。这项活动非常耗时,因为它需要:
First-generation Unix systems implemented process
creation in a rather clumsy way: when a fork(
) system call was issued, the kernel duplicated the whole
parent address space in the literal sense of the word and assigned the
copy to the child process. This activity was quite time consuming
since it required:
为子进程的页表分配页框
Allocating page frames for the Page Tables of the child process
为子进程的页面分配页框
Allocating page frames for the pages of the child process
初始化子进程的页表
Initializing the Page Tables of the child process
将父进程的页面复制到子进程对应的页面中
Copying the pages of the parent process into the corresponding pages of the child process
这种创建地址空间的方式涉及许多内存访问,占用了许多CPU周期,并且完全破坏了缓存内容。最后但并非最不重要的一点是,这通常是毫无意义的,因为许多子进程通过加载新程序来开始执行,从而完全丢弃继承的地址空间(参见第20 章)。
This way of creating an address space involved many memory accesses, used up many CPU cycles, and completely spoiled the cache contents. Last but not least, it was often pointless because many child processes start their execution by loading a new program, thus discarding entirely the inherited address space (see Chapter 20).
现代 Unix 内核(包括 Linux)遵循一种更有效的方法,称为“写入时复制” (COW) )。这个想法非常简单:不是复制页面框架,而是在父进程和子进程之间共享它们。然而,只要它们是共享的,就无法修改。每当父进程或子进程尝试写入共享页框架时,就会发生异常。此时,内核将该页面复制到一个新的页面框架中,并将其标记为可写。原始页框保持写保护状态:当其他进程尝试写入该页框时,内核会检查写入进程是否是该页框的唯一所有者;在这种情况下,它使页面框架可供进程写入。
Modern Unix kernels, including Linux, follow a more efficient approach called Copy On Write (COW ). The idea is quite simple: instead of duplicating page frames, they are shared between the parent and the child process. However, as long as they are shared, they cannot be modified. Whenever the parent or the child process attempts to write into a shared page frame, an exception occurs. At this point, the kernel duplicates the page into a new page frame that it marks as writable. The original page frame remains write-protected: when the other process tries to write into it, the kernel checks whether the writing process is the only owner of the page frame; in such a case, it makes the page frame writable for the process.
页面描述符的字段_count用于跟踪共享相应页框的进程数量。每当进程释放页框或在其上执行写入时复制时,其_count字段就会减少;页框只有当_count变为
1时才被释放(参见第8章的“页描述符”
-部分)。
The _count field of the page
descriptor is used to keep track of the number of processes that are
sharing the corresponding page frame. Whenever a process releases a
page frame or a Copy On Write is executed on it, its _count field is decreased; the page frame is
freed only when _count becomes
-1 (see the section "Page Descriptors" in
Chapter 8).
现在我们来描述一下Linux是如何实现COW的。当handle_ pte_fault( )确定页面错误异常是由对内存中存在的页面的访问引起时,它执行以下指令:
Let's now describe how Linux implements COW. When handle_ pte_fault( ) determines that the
Page Fault exception was caused by an access to a page present in
memory, it executes the following instructions:
如果(pte_present(条目)){
如果(写访问){
if (!pte_write(条目))
返回 do_wp_page(mm, vma, 地址, pte, pmd, 条目);
条目 = pte_mkdirty(条目);
}
条目 = pte_mkyoung(条目);
set_pte(pte, 条目);
lush_tlb_page(vma, 地址);
pte_unmap(pte);
spin_unlock(&mm->page_table_lock);
返回VM_FAULT_MINOR;
} if (pte_present(entry)) {
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, pmd, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
set_pte(pte, entry);
flush_tlb_page(vma, address);
pte_unmap(pte);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}该handle_pte_fault( )
功能是独立于体系结构的:它考虑每个可能违反页面访问权限的情况。然而,在 80 × 86 架构中,如果存在该页,则访问是为了写入,并且页框是写保护的(请参阅前面的“处理地址空间内的错误地址”一节)。因此,该do_wp_page( )函数始终被调用。
The handle_pte_fault( )
function is architecture-independent: it considers each possible
violation of the page access rights. However, in the 80 × 86
architecture, if the page is present, the access was for writing and
the page frame is write-protected (see the earlier section "Handling a Faulty Address Inside
the Address Space"). Thus, the do_wp_page( ) function is always
invoked.
函数[ * ]首先导出页错误异常中涉及的页表项所引用的页框的页描述符do_wp_page( )
。接下来,该函数确定该页面是否确实必须被复制。如果只有一个进程拥有该页,则写入时复制不适用,并且该进程应该可以自由地写入该页。基本上,该函数读取页面描述符的字段:如果它等于0(单个所有者),则不得执行COW。实际上,检查稍微复杂一些,因为当页面插入交换缓存时,该字段也会增加(参见第17章“交换缓存”一节)_count_count)以及当PG_private页面描述符中的标志被设置时。但是,当不执行 COW 时,页框将被标记为可写,以便在尝试写入时不会导致进一步的页面错误异常:
The do_wp_page( )
function[*] starts by deriving the page descriptor of the page frame
referenced by the Page Table entry involved in the Page Fault
exception. Next, the function determines whether the page must really
be duplicated. If only one process owns the page, Copy On Write does
not apply, and the process should be free to write the page.
Basically, the function reads the _count field of the page descriptor: if it
is equal to 0 (a single owner), COW must not be done. Actually, the
check is slightly more complicated, because the _count field is also increased when the page
is inserted into the swap cache (see the section "The Swap Cache" in Chapter 17) and when the PG_private flag in the page descriptor is
set. However, when COW is not to be done, the page frame is marked as
writable, so that it does not cause further Page Fault exceptions when
writes are attempted:
set_pte(page_table, Maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),vma));
lush_tlb_page(vma, 地址);
pte_unmap(页表);
spin_unlock(&mm->page_table_lock);
返回VM_FAULT_MINOR; set_pte(page_table, maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),vma));
flush_tlb_page(vma, address);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;如果通过 COW 在多个进程之间共享页面,则该函数会将旧页框 ( old_page) 的内容复制到新分配的页框 ( new_page) 中。为了避免竞争条件,在开始复制操作之前get_page( )调用 来增加使用计数器:old_page
If the page is shared among several processes by means of COW,
the function copies the content of the old page frame (old_page) into the newly allocated one
(new_page). To avoid race
conditions, get_page( ) is invoked
to increase the usage counter of old_page before starting the copy
operation:
old_page = pte_page(pte);
pte_unmap(页表);
获取页面(旧页面);
spin_unlock(&mm->page_table_lock);
if (old_page == virt_to_page(empty_zero_page))
new_page = alloc_page(GFP_HIGHUSER | _ _GFP_ZERO);
} 别的 {
new_page = alloc_page(GFP_HIGHUSER);
vfrom = kmap_atomic(old_page, KM_USER0)
vto = kmap_atomic(new_page, KM_USER1);
copy_page(vto, vfrom);
kunmap_atomic(vfrom, KM_USER0);
kunmap_atomic(vto, KM_USER0);
} old_page = pte_page(pte);
pte_unmap(page_table);
get_page(old_page);
spin_unlock(&mm->page_table_lock);
if (old_page == virt_to_page(empty_zero_page))
new_page = alloc_page(GFP_HIGHUSER | _ _GFP_ZERO);
} else {
new_page = alloc_page(GFP_HIGHUSER);
vfrom = kmap_atomic(old_page, KM_USER0)
vto = kmap_atomic(new_page, KM_USER1);
copy_page(vto, vfrom);
kunmap_atomic(vfrom, KM_USER0);
kunmap_atomic(vto, KM_USER0);
}如果旧页是零页,则新帧在分配时会有效地用零填充(_
_GFP_ZERO标志)。否则,将使用宏复制页框内容copy_page( )。对零页的特殊处理并不是严格要求的,但它提高了系统性能,因为它通过减少地址引用来保留微处理器硬件高速缓存。
If the old page is the zero page, the new frame is efficiently
filled with zeros when it is allocated (_
_GFP_ZERO flag). Otherwise, the page frame content is copied
using the copy_page( ) macro.
Special handling for the zero page is not strictly required, but it
improves the system performance, because it preserves the
microprocessor hardware cache by making fewer address
references.
由于页框的分配可能会阻塞进程,因此该函数会检查自函数开始以来页表条目是否已被修改(pte并且*page_table没有相同的值)。在这种情况下,新的页框被释放,使用计数器
old_page减少(以撤消先前所做的增量),并且函数终止。
Because the allocation of a page frame can block the process,
the function checks whether the Page Table entry has been modified
since the beginning of the function (pte and *page_table do not have the same value). In
this case, the new page frame is released, the usage counter of
old_page is decreased (to undo the
increment made previously), and the function terminates.
如果一切正常,新页框的物理地址最终被写入页表项,并且相应的TLB寄存器失效:
If everything looks OK, the physical address of the new page frame is finally written into the Page Table entry, and the corresponding TLB register is invalidated:
spin_lock(&mm->page_table_lock);
条目 = Maybe_mkwrite(pte_mkdirty(mk_pte(new_page,
vma->vm_page_prot)),vma);
set_pte(页表,条目);
lush_tlb_page(vma, 地址);
lru_cache_add_active(new_page);
pte_unmap(页表);
spin_unlock(&mm->page_table_lock); spin_lock(&mm->page_table_lock);
entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page,
vma->vm_page_prot)),vma);
set_pte(page_table, entry);
flush_tlb_page(vma, address);
lru_cache_add_active(new_page);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);该lru_cache_add_active( )
函数将新的页框插入到与交换相关的数据结构中;
其描述见第 17 章。
The lru_cache_add_active( )
function inserts the new page frame in the swap-related data
structures; see Chapter 17
for its description.
最后,do_wp_page( )
将使用计数器减少old_page两倍。第一个减量撤消复制页框内容之前所做的安全增量;第二个减量反映了当前进程不再拥有该页框的事实。
Finally, do_wp_page( )
decreases the usage counter of old_page twice. The first decrement undoes
the safety increment made before copying the page frame contents; the
second decrement reflects the fact that the current process no longer
owns the page frame.
我们在第8章的“非连续内存区域管理”一节中看到,内核在更新与非连续内存区域对应的页表条目时相当懒惰。事实上,
和函数将自身限制为更新主内核页表(即页全局目录及其子页表)。vmalloc( )vfree( )init_mm.pgd
We have seen in the section "Noncontiguous Memory Area
Management" in Chapter
8 that the kernel is quite lazy in updating the Page Table
entries corresponding to noncontiguous memory areas. In fact, the
vmalloc( ) and vfree( ) functions limit themselves to
updating the master kernel Page Tables (i.e., the Page Global
Directory init_mm.pgd and its child
Page Tables).
然而,一旦内核初始化阶段结束,主内核页表就不会被任何进程或内核线程直接使用。因此,考虑内核模式下的进程第一次访问非连续内存区域。当将线性地址转换为物理地址时,CPU 的内存管理单元遇到空页表条目并引发页面错误。但是,处理程序会识别这种特殊情况,因为异常发生在内核模式中,并且错误的线性地址大于
TASK_SIZE。因此,do_page_fault( )处理程序检查相应的主内核页表条目:
However, once the kernel initialization phase ends, the master
kernel Page Tables are not directly used by any process or kernel
thread. Thus, consider the first time that a process in Kernel Mode
accesses a noncontiguous memory area. When translating the linear
address into a physical address, the CPU's memory management unit
encounters a null Page Table entry and raises a Page Fault. However,
the handler recognizes this special case because the exception
occurred in Kernel Mode, and the faulty linear address is greater than
TASK_SIZE. Thus, the do_page_fault( ) handler checks the
corresponding master kernel Page Table entry:
vmalloc_错误:
asm("movl %%cr3
,%0":"=r" (pgd_paddr));
pgd = pgd_index(地址) + (pgd_t *) _ _va(pgd_paddr);
pgd_k = init_mm.pgd + pgd_index(地址);
if (!pgd_present(*pgd_k))
转到无上下文;
pud = pud_offset(pgd, 地址);
pud_k = pud_offset(pgd_k, 地址);
if (!pud_present(*pud_k))
转到无上下文;
pmd = pmd_offset(pud, 地址);
pmd_k = pmd_offset(pud_k, 地址);
if (!pmd_present(*pmd_k))
转到无上下文;
set_pmd(pmd, *pmd_k);
pte_k = pte_offset_kernel(pmd_k, 地址);
if (!pte_present(*pte_k))
转到无上下文;
返回; vmalloc_fault:
asm("movl %%cr3
,%0":"=r" (pgd_paddr));
pgd = pgd_index(address) + (pgd_t *) _ _va(pgd_paddr);
pgd_k = init_mm.pgd + pgd_index(address);
if (!pgd_present(*pgd_k))
goto no_context;
pud = pud_offset(pgd, address);
pud_k = pud_offset(pgd_k, address);
if (!pud_present(*pud_k))
goto no_context;
pmd = pmd_offset(pud, address);
pmd_k = pmd_offset(pud_k, address);
if (!pmd_present(*pmd_k))
goto no_context;
set_pmd(pmd, *pmd_k);
pte_k = pte_offset_kernel(pmd_k, address);
if (!pte_present(*pte_k))
goto no_context;
return;局部pgd_paddr变量加载当前进程的页面全局目录的物理地址,该地址存储在寄存器中cr3。[ * ]pgd然后将 对应的线性地址加载到局部变量中
,pgd_paddr并且pgd_k将主内核页面全局目录的线性地址加载到局部变量中。
The pgd_paddr local variable
is loaded with the physical address of the Page Global Directory of
the current process, which is stored in the cr3 register.[*] The pgd local
variable is then loaded with the linear address corresponding to
pgd_paddr, and the pgd_k local variable is loaded with the
linear address of the master kernel Page Global Directory.
如果错误线性地址对应的主内核页面全局目录项为空,则函数跳转到该标签处的代码(参见前面的“处理地址空间外的错误地址no_context”一节)。否则,该函数将查看与错误线性地址相对应的主内核页上目录条目和主内核页中目录条目。同样,如果这些条目之一为空,则跳转到标签
。否则,主条目将被复制到进程的页面中间目录的相应条目中。[ * ]no_context然后对主页表条目重复整个操作。
If the master kernel Page Global Directory entry corresponding
to the faulty linear address is null, the function jumps to the code
at the no_context label (see the
earlier section "Handling
a Faulty Address Outside the Address Space"). Otherwise, the
function looks at the master kernel Page Upper Directory entry and at
the master kernel Page Middle Directory entry corresponding to the
faulty linear address. Again, if either one of these entries is null,
a jump is done to the no_context
label. Otherwise, the master entry is copied into the corresponding
entry of the process's Page Middle Directory.[*] Then the whole operation is repeated with the master
Page Table entry.
[ * ]但是,这种情况永远不应该发生,因为内核不会将特权页帧分配给进程。
[*] However, this case should never happen, because the kernel does not assign privileged page frames to the processes.
[ * ]在 80 × 86 微处理器上,这些分配永远不会发生,因为页面上层目录始终包含在页面全局目录中,而页面中间目录要么包含在页面上层目录中(PAE 未启用),要么与页面上层目录(启用 PAE)。
[*] On 80 × 86 microprocessors, these allocations never occur, because the Page Upper Directories are always included in the Page Global Directory, and the Page Middle Directories are either included in the Page Upper Directory (PAE not enabled) or allocated together with the Page Upper Directory (PAE enabled).
[ * ]为了简化这个函数的描述,我们跳过处理反向映射的语句,这个主题将在第 17 章的“反向映射” 部分中介绍。
[*] To simplify the description of this function, we skip the statements that deal with reverse mapping, a topic that will be covered in the section "Reverse Mapping" in Chapter 17.
[ † ]如果调试器尝试写入属于被跟踪进程的只读内存区域的页面,内核不会设置该Read/Write标志。该
maybe_mkwrite( )函数负责处理这种特殊情况。
[†] If a debugger attempts to write in a page belonging to a
read-only memory region of the traced process, the kernel does not
set the Read/Write flag. The
maybe_mkwrite( ) function takes
care of this special case.
[ * ]为了简化这个函数的描述,我们跳过处理反向映射的语句,这个主题将在第 17 章的“反向映射” 部分中介绍。
[*] To simplify the description of this function, we skip the statements that deal with reverse mapping, a topic that will be covered in the section "Reverse Mapping" in Chapter 17.
[ * ]内核不使用current->mm->pgd来派生地址,因为此错误可能随时发生,即使在进程切换期间也是如此。
[*] The kernel doesn't use current->mm->pgd to derive the
address because this fault can occur anytime, even during a
process switch.
[ * ]你可能还记得第 2 章中的“ Linux 中的分页” 部分,如果启用了 PAE,则页面上层目录条目不能为空;否则,如果禁用 PAE,则设置页面中间目录条目也会隐式设置页面上层目录条目。
[*] You might remember from the section "Paging in Linux" in Chapter 2 that if PAE is enabled then the Page Upper Directory entry cannot be null; otherwise, if PAE is disabled, setting the Page Middle Directory entry implicitly sets the Page Upper Directory entry, too.
在“进程的地址空间”部分前面提到的六种典型情况(其中进程获取新的内存区域)中,第一种情况(发出系统fork( )调用)需要为子进程创建一个全新的地址空间。相反,当进程终止时,内核会破坏其地址空间。在本节中,我们将讨论 Linux 如何执行这两项活动。
Of the six typical cases mentioned earlier in the section
"The Process's Address
Space," in which a process gets new memory regions, the first
one—issuing a fork( ) system
call—requires the creation of a whole new address space for the child
process. Conversely, when a process terminates, the kernel destroys its
address space. In this section, we discuss how these two activities are
performed by Linux.
在第3章的“ clone()、fork()和vfork()系统调用”一节中,我们提到内核
在创建copy_mm( ) 一个新的过程。该函数通过设置新进程的所有页表和内存描述符来创建进程地址空间。
In the section "The clone( ), fork( ), and
vfork( ) System Calls" in Chapter 3, we mentioned that the
kernel invokes the copy_mm( )
function while creating a new process. This function creates the process
address space by setting up all Page Tables and memory descriptors of
the new process.
每个进程通常都有自己的地址空间,但可以通过设置标志clone(
)来调用来创建轻量级进程CLONE_VM。这些进程共享相同的地址空间;也就是说,它们可以寻址同一组页面。
Each process usually has its own address space, but lightweight
processes can be created by calling clone(
) with the CLONE_VM flag
set. These processes share the same address space; that is, they are
allowed to address the same set of pages.
按照前面描述的 COW 方法,传统进程继承其父进程的地址空间:只要页面仅被读取,它们就保持共享。然而,当其中一个进程尝试写入其中之一时,该页面就会被复制;一段时间后,分叉进程通常会获得自己的地址空间,该地址空间与父进程的地址空间不同。另一方面,轻量级进程使用其父进程的地址空间。Linux 简单地通过不复制地址空间来实现它们。轻量级进程的创建速度可以比普通进程快得多,并且只要父进程和子进程仔细协调其访问,页面共享也可以被视为一种好处。
Following the COW approach described earlier, traditional processes inherit the address space of their parent: pages stay shared as long as they are only read. When one of the processes attempts to write one of them, however, the page is duplicated; after some time, a forked process usually gets its own address space that is different from that of the parent process. Lightweight processes, on the other hand, use the address space of their parent process. Linux implements them simply by not duplicating address space. Lightweight processes can be created considerably faster than normal processes, and the sharing of pages can also be considered a benefit as long as the parent and children coordinate their accesses carefully.
如果新进程已通过clone( )系统调用创建,并且设置了参数CLONE_VM标志,则为克隆 ( ) 提供其父进程 ( ) 的地址空间:flagcopy_mm( )tskcurrent
If the new process has been created by means of the clone( ) system call and if the CLONE_VM flag of the flag parameter is set, copy_mm( ) gives the clone (tsk) the address space of its parent
(current):
如果(clone_flags&CLONE_VM){
atomic_inc(&当前->mm->mm_users);
spin_unlock_wait(&当前->mm->page_table_lock);
tsk->mm = 当前->mm;
tsk->active_mm = 当前->mm;
返回0;
} if (clone_flags & CLONE_VM) {
atomic_inc(¤t->mm->mm_users);
spin_unlock_wait(¤t->mm->page_table_lock);
tsk->mm = current->mm;
tsk->active_mm = current->mm;
return 0;
}调用该spin_unlock_wait(
)函数可确保,如果进程的页表自旋锁被其他某个 CPU 持有,则页面错误处理程序不会终止,直到该锁被释放。事实上,除了保护页表之外,该自旋锁还必须禁止创建共享描述符的新轻量级进程current->mm。
Invoking the spin_unlock_wait(
) function ensures that, if the page table spin lock of the
process is held by some other CPU, the page fault handler does not
terminate until that lock is released. In fact, beside protecting the
page tables, this spin lock must forbid the creation of new
lightweight processes sharing the current->mm descriptor.
如果CLONE_VM未设置该标志,copy_mm( )则必须创建一个新的地址空间(即使在进程请求地址之前没有在该地址空间内分配内存)。该函数分配一个新的内存描述符,将其地址存储在mm新进程描述符的字段
中tsk,并将内容复制
current->mm到 中tsk->mm。然后它更改新描述符的一些字段:
If the CLONE_VM flag is not
set, copy_mm( ) must create a new
address space (even though no memory is allocated within that address
space until the process requests an address). The function allocates a
new memory descriptor, stores its address in the mm field of the new process descriptor
tsk, and copies the contents of
current->mm into tsk->mm. It then changes a few fields of
the new descriptor:
tsk->mm = kmem_cache_alloc(mm_cachep, SLAB_KERNEL);
memcpy(tsk->mm, 当前->mm, sizeof(*tsk->mm));
atomic_set(&tsk->mm->mm_users, 1);
atomic_set(&tsk->mm->mm_count, 1);
init_rwsem(&tsk->mm->mmap_sem);
tsk->mm->core_waiters = 0;
tsk->mm->page_table_lock = SPIN_LOCK_UNLOCKED;
tsk->mm->ioctx_list_lock = RW_LOCK_UNLOCKED;
tsk->mm->ioctx_list = NULL;
tsk->mm->default_kioctx = INIT_KIOCTX(tsk->mm->default_kioctx,
*啧->毫米);
tsk->mm->free_area_cache = (TASK_SIZE/3+0xfff)&0xfffff000;
tsk->mm->pgd = pgd_alloc(tsk->mm);
tsk->mm->def_flags = 0; tsk->mm = kmem_cache_alloc(mm_cachep, SLAB_KERNEL);
memcpy(tsk->mm, current->mm, sizeof(*tsk->mm));
atomic_set(&tsk->mm->mm_users, 1);
atomic_set(&tsk->mm->mm_count, 1);
init_rwsem(&tsk->mm->mmap_sem);
tsk->mm->core_waiters = 0;
tsk->mm->page_table_lock = SPIN_LOCK_UNLOCKED;
tsk->mm->ioctx_list_lock = RW_LOCK_UNLOCKED;
tsk->mm->ioctx_list = NULL;
tsk->mm->default_kioctx = INIT_KIOCTX(tsk->mm->default_kioctx,
*tsk->mm);
tsk->mm->free_area_cache = (TASK_SIZE/3+0xfff)&0xfffff000;
tsk->mm->pgd = pgd_alloc(tsk->mm);
tsk->mm->def_flags = 0;请记住,pgd_alloc(
)宏为新进程分配一个页面全局目录。
Remember that the pgd_alloc(
) macro allocates a Page Global Directory for the new
process.
init_new_context( )然后调用与体系结构相关的函数:当处理80×86处理器时,该函数检查当前进程是否拥有定制的本地描述符表;如果是,init_new_context( )则复制 的本地描述符表current并将其添加到 的地址空间中
tsk。
The architecture-dependent init_new_context( ) function is then
invoked: when dealing with 80 × 86 processors, this function checks
whether the current process owns a customized Local Descriptor Table;
if so, init_new_context( ) makes a
copy of the Local Descriptor Table of current and adds it to the address space of
tsk.
最后,dup_mmap( )
调用该函数来复制父进程的内存区域和页表。该函数将新的内存描述符插入tsk->mm到内存描述符的全局列表中。然后它从 指向的区域开始扫描父进程拥有的区域列表current->mm->mmap。它复制遇到的每个
vm_area_struct内存区域描述符,并将副本插入区域列表和子进程拥有的红黑树中。
Finally, the dup_mmap( )
function is invoked to duplicate both the memory regions and the Page
Tables of the parent process. This function inserts the new memory
descriptor tsk->mm in the global
list of memory descriptors. Then it scans the list of regions owned by
the parent process, starting from the one pointed to by current->mm->mmap. It duplicates each
vm_area_struct memory region
descriptor encountered and inserts the copy in the list of regions and
in the red-black tree owned by the child process.
插入新的内存区域描述符后,如有必要,dup_mmap( )调用copy_page_range( )创建映射内存区域中包含的页面组所需的页表并初始化新的页表条目。特别是,与私有、可写页面(VM_SHARED标志关闭和VM_MAYWRITE标志打开)相对应的每个页框对于父级和子级都被标记为只读,以便它将通过写入时复制机制进行处理。
Right after inserting a new memory region descriptor, dup_mmap( ) invokes copy_page_range( ) to create, if necessary,
the Page Tables needed to map the group of pages included in the
memory region and to initialize the new Page Table entries. In
particular, each page frame corresponding to a private, writable page
(VM_SHARED flag off and VM_MAYWRITE flag on) is marked as read-only
for both the parent and the child, so that it will be handled with the
Copy On Write mechanism.
当进程终止时,内核调用该
exit_mm( )函数来释放该进程拥有的地址空间:
When a process terminates, the kernel invokes the
exit_mm( ) function to release the
address space owned by that process:
mm_release(tsk, tsk->mm);
if (!(mm = tsk->mm)) /* 内核线程 ? */
返回;
down_read(&mm->mmap_sem); mm_release(tsk, tsk->mm);
if (!(mm = tsk->mm)) /* kernel thread ? */
return;
down_read(&mm->mmap_sem);该mm_release( )函数本质上是唤醒所有在完成时休眠的进程(参见第 5 章中的“完成”tsk->vfork_done部分)。通常,只有当退出进程是通过以下方式创建时,相应的等待队列才是非空的:vfork( )
系统调用(参见第 3 章中的“ clone()、fork() 和 vfork() 系统调用”部分)。
The mm_release( ) function
essentially wakes up all processes sleeping in the tsk->vfork_done completion (see the
section "Completions" in Chapter 5). Typically, the
corresponding wait queue is nonempty only if the exiting process was
created by means of the vfork( )
system call (see the section "The clone( ), fork( ), and
vfork( ) System Calls" in Chapter 3).
如果被终止的进程不是内核线程,则该
exit_mm( )函数必须释放内存描述符和所有相关的数据结构。首先,它检查是否mm->core_waiters设置了该标志:如果设置了,则该进程正在将内存内容转储到核心文件。为了避免核心文件损坏,该函数使用
mm->core_done和mm->core_startup_done完成来序列化共享相同内存描述符的轻量级进程的执行mm。
If the process being terminated is not a kernel thread, the
exit_mm( ) function must release
the memory descriptor and all related data structures. First of all,
it checks whether the mm->core_waiters flag is set: if it does,
then the process is dumping the contents of the memory to a core file.
To avoid corruption in the core file, the function makes use of the
mm->core_done and mm->core_startup_done completions to
serialize the execution of the lightweight processes sharing the same
memory descriptor mm.
接下来,该函数增加内存描述符的主使用计数器,重置mm进程描述符的字段,并将处理器置于惰性 TLB 模式(请参阅第 2 章中的“处理硬件缓存和 TLB ” ):
Next, the function increases the memory descriptor's main usage
counter, resets the mm field of the
process descriptor, and puts the processor in lazy TLB mode (see
"Handling the Hardware
Cache and the TLB" in Chapter 2):
atomic_inc(&mm->mm_count);
spin_lock(tsk->alloc_lock);
tsk->mm = NULL;
up_read(&mm->map_sem);
Enter_lazy_tlb(mm, 当前);
spin_unlock(tsk->alloc_lock);
毫米输入(毫米); atomic_inc(&mm->mm_count);
spin_lock(tsk->alloc_lock);
tsk->mm = NULL;
up_read(&mm->map_sem);
enter_lazy_tlb(mm, current);
spin_unlock(tsk->alloc_lock);
mmput(mm);最后,mmput( )
调用该函数来释放本地描述符表、内存区域描述符和页表。然而,内存描述符本身并没有被释放,因为它exit_mm(
)增加了主使用计数器。当被终止的进程被有效地从本地CPU中逐出时,该函数将释放该描述符(参见第7章中的“ schedule()函数”finish_task_switch(
)部分)。
Finally, the mmput( )
function is invoked to release the Local Descriptor Table, the memory
region descriptors, and the Page Tables. The memory descriptor itself,
however, is not released, because exit_mm(
) has increased the main usage counter. The descriptor will
be released by the finish_task_switch(
) function when the process being terminated will be
effectively evicted from the local CPU (see the section "The schedule( ) Function"
in Chapter 7).
每个 Unix 进程都拥有一个称为堆的特定内存区域
,用于满足进程的动态内存请求。内存描述符的和start_brk字段分别界定该区域的起始地址和结束地址。brk
Each Unix process owns a specific memory region called the
heap, which is used to satisfy the process's
dynamic memory requests. The start_brk and brk fields of the memory descriptor delimit
the starting and ending addresses, respectively, of that region.
进程可以使用以下API来请求和释放动态内存:
The following APIs can be used by the process to request and release dynamic memory:
malloc(size)malloc(size)请求size动态内存字节;如果分配成功,则返回第一个内存位置的线性地址。
Requests size bytes of
dynamic memory; if the allocation succeeds, it returns the linear
address of the first memory location.
calloc(n,size)calloc(n,size)请求一个n由大小为 的元素组成的数组size;如果分配成功,则将数组组件初始化为 0 并返回第一个元素的线性地址。
Requests an array consisting of n elements of size size; if the allocation succeeds, it
initializes the array components to 0 and returns the linear
address of the first element.
realloc(ptr,size)realloc(ptr,size)Changes the size of a memory area previously allocated by
malloc( ) or calloc( )
.
free(addr)free(addr)malloc( )释放由 分配的内存区域或calloc( )初始地址为 的
内存区域addr。
Releases the memory region allocated by malloc( ) or calloc( ) that has an initial address of
addr.
brk(addr)brk(addr)直接修改堆的大小;参数addr指定 的新值current->mm->brk,返回值是新的内存区域结束地址(进程必须检查它是否与请求的
addr值一致)。
Modifies the size of the heap directly; the addr parameter specifies the new value
of current->mm->brk, and
the return value is the new ending address of the memory region
(the process must check whether it coincides with the requested
addr value).
sbrk(incr)sbrk(incr)Is similar to brk( )
, except that the incr parameter specifies the increment
or decrement of the heap size in bytes.
该brk( )函数与列出的其他函数不同,因为它是唯一作为系统调用实现的函数。所有其他功能均通过使用brk( )和在 C 库中实现mmap( )。[ * ]
The brk( ) function differs
from the other functions listed because it is the only one implemented
as a system call. All the other functions are implemented in the C
library by using brk( ) and mmap( ).[*]
当用户态进程调用brk( )系统调用时,内核执行该
sys_brk(addr)函数。该函数首先验证addr
参数是否落在包含进程代码的内存区域内;如果是这样,它会立即返回,因为堆不能与包含进程代码的内存区域重叠:
When a process in User Mode invokes the brk( ) system call, the kernel executes the
sys_brk(addr) function. This function
first verifies whether the addr
parameter falls inside the memory region that contains the process code;
if so, it returns immediately because the heap cannot overlap with
memory region containing the process's code:
mm = 当前->mm;
down_write(&mm->mmap_sem);
if (addr < mm->end_code) {
出去:
up_write(&mm->mmap_sem);
返回mm->brk;
} mm = current->mm;
down_write(&mm->mmap_sem);
if (addr < mm->end_code) {
out:
up_write(&mm->mmap_sem);
return mm->brk;
}由于brk( )系统调用作用于内存区域,因此它会分配和释放整个页面。因此,该函数将 的值对齐addr为 的倍数PAGE_SIZE,并将结果与brk内存描述符字段的值进行比较:
Because the brk( ) system call
acts on a memory region, it allocates and deallocates whole pages.
Therefore, the function aligns the value of addr to a multiple of PAGE_SIZE and compares the result with the
value of the brk field of the memory
descriptor:
newbrk = (地址 + 0xfff) & 0xfffff000;
oldbrk = (mm->brk + 0xfff) & 0xfffff000;
if (oldbrk == newbrk) {
mm->brk = 地址;
转到出去;
} newbrk = (addr + 0xfff) & 0xfffff000;
oldbrk = (mm->brk + 0xfff) & 0xfffff000;
if (oldbrk == newbrk) {
mm->brk = addr;
goto out;
}如果进程要求收缩堆,sys_brk( )则调用该do_munmap( )函数来完成该工作,然后返回:
If the process asked to shrink the heap, sys_brk( ) invokes the do_munmap( ) function to do the job and then
returns:
if (addr <= mm->brk) {
if (!do_munmap(mm, newbrk, oldbrk-newbrk))
mm->brk = 地址;
转到出去;
} if (addr <= mm->brk) {
if (!do_munmap(mm, newbrk, oldbrk-newbrk))
mm->brk = addr;
goto out;
}如果进程要求扩大堆,sys_brk( )首先检查进程是否允许这样做。如果进程尝试分配超出其限制的内存,该函数将简单地返回 的原始值,mm->brk而不分配更多内存:
If the process asked to enlarge the heap, sys_brk( ) first checks whether the process is
allowed to do so. If the process is trying to allocate memory outside
its limit, the function simply returns the original value of mm->brk without allocating more
memory:
rlim = 当前->信号->rlim[RLIMIT_DATA].rlim_cur;
if (rlim < RLIM_INFINITY && addr - mm->start_data > rlim)
转到出去; rlim = current->signal->rlim[RLIMIT_DATA].rlim_cur;
if (rlim < RLIM_INFINITY && addr - mm->start_data > rlim)
goto out;然后,该函数检查扩大的堆是否会与属于该进程的其他内存区域重叠,如果是,则返回而不执行任何操作:
The function then checks whether the enlarged heap would overlap some other memory region belonging to the process and, if so, returns without doing anything:
if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
转到出去; if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
goto out;如果一切正常,do_brk(
)则调用该函数。如果返回oldbrk值,则分配成功,sys_brk( )返回值
addr;否则,它返回旧
mm->brk值:
If everything is OK, the do_brk(
) function is invoked. If it returns the oldbrk value, the allocation was successful
and sys_brk( ) returns the value
addr; otherwise, it returns the old
mm->brk value:
if (do_brk(oldbrk, newbrk-oldbrk) == oldbrk)
mm->brk = 地址;
转到出去; if (do_brk(oldbrk, newbrk-oldbrk) == oldbrk)
mm->brk = addr;
goto out;该函数实际上是仅处理匿名内存区域do_brk( )的简化版本。do_mmap(
)它的调用可能被认为等同于:
The do_brk( ) function is
actually a simplified version of do_mmap(
) that handles only anonymous memory regions. Its invocation
might be considered equivalent to:
do_mmap(NULL, oldbrk, newbrk-oldbrk, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_FIXED|MAP_PRIVATE, 0) do_mmap(NULL, oldbrk, newbrk-oldbrk, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_FIXED|MAP_PRIVATE, 0)do_brk( )比 稍快do_mmap( ),因为它通过假设内存区域不映射磁盘上的文件来避免对内存区域对象字段的多次检查。
do_brk( ) is slightly faster
than do_mmap( ), because it avoids
several checks on the memory region object fields by assuming that the
memory region doesn't map a file on disk.
操作系统为在用户模式下运行的进程提供一组接口,用于与 CPU、磁盘和打印机等硬件设备进行交互。在应用程序和硬件之间放置一个额外的层有几个优点。首先,它使用户无需研究硬件设备的低级编程特性,从而使编程变得更加容易。其次,它极大地提高了系统安全性,因为内核可以在尝试满足请求之前在接口级别检查请求的准确性。最后但并非最不重要的一点是,这些接口使程序更具可移植性,因为它们可以在提供相同接口集的每个内核上正确编译和执行。
Operating systems offer processes running in User Mode a set of interfaces to interact with hardware devices such as the CPU, disks, and printers. Putting an extra layer between the application and the hardware has several advantages. First, it makes programming easier by freeing users from studying low-level programming characteristics of hardware devices. Second, it greatly increases system security, because the kernel can check the accuracy of the request at the interface level before attempting to satisfy it. Last but not least, these interfaces make programs more portable, because they can be compiled and executed correctly on every kernel that offers the same set of interfaces.
Unix系统通过系统调用的方式实现用户态进程和硬件设备之间的大部分接口 发布到内核。本章详细研究 Linux 如何实现用户模式程序向内核发出的系统调用。
Unix systems implement most interfaces between User Mode processes and hardware devices by means of system calls issued to the kernel. This chapter examines in detail how Linux implements system calls that User Mode programs issue to the kernel.
让我们首先强调应用程序编程接口(API)和系统调用之间的区别。前者是一个函数定义,指定如何获取给定的服务,而后者是通过软件中断向内核发出的显式请求。
Let's start by stressing the difference between an application programmer interface (API) and a system call. The former is a function definition that specifies how to obtain a given service, while the latter is an explicit request to the kernel made via a software interrupt.
Unix 系统包含多个为程序员提供 API 的函数库。libc标准 C 库定义的一些 API 引用了包装例程 (其唯一目的是发出系统调用的例程)。通常,每个系统调用都有一个相应的包装例程,它定义了应用程序应使用的API。
Unix systems include several libraries of functions that provide APIs to programmers. Some of the APIs defined by the libc standard C library refer to wrapper routines (routines whose only purpose is to issue a system call). Usually, each system call has a corresponding wrapper routine, which defines the API that application programs should employ.
顺便说一下,反之则不然——API 不一定对应于特定的系统调用。首先,API可以直接在用户模式下提供服务。(对于诸如数学函数之类的抽象事物,可能没有理由进行系统调用。)其次,单个 API 函数可以进行多个系统调用。此外,多个 API 函数可以进行相同的系统调用,但在其周围包含额外的功能。例如,在 Linux 中,malloc( ) ,calloc( ) , 和free( )
API 在libc库中实现
。该库中的代码跟踪分配和释放请求并使用brk(
) 系统调用来扩大或缩小进程堆(请参阅第 9 章中的“管理堆”部分)。
The converse is not true, by the way—an API does not necessarily
correspond to a specific system call. First of all, the API could offer
its services directly in User Mode. (For something abstract such as math
functions, there may be no reason to make system calls.) Second, a
single API function could make several system calls. Moreover, several
API functions could make the same system call, but wrap extra
functionality around it. For instance, in Linux, the malloc( ) , calloc( ) , and free( )
APIs are implemented in the libc
library. The code in this library keeps track of the allocation and
deallocation requests and uses the brk(
) system call to enlarge or shrink the process heap (see
the section "Managing the
Heap" in Chapter
9).
POSIX 标准指的是 API,而不是系统调用。如果一个系统为应用程序提供了适当的 API 集,则无论相应的功能如何实现,都可以被认证为符合 POSIX 标准。事实上,一些非 Unix 系统已被认证为 POSIX 兼容,因为它们在用户模式库中提供所有传统 Unix 服务。
The POSIX standard refers to APIs and not to system calls. A system can be certified as POSIX-compliant if it offers the proper set of APIs to the application programs, no matter how the corresponding functions are implemented. As a matter of fact, several non-Unix systems have been certified as POSIX-compliant, because they offer all traditional Unix services in User Mode libraries.
从程序员的角度来看,API 和系统调用之间的区别是无关紧要的——唯一重要的是函数名称、参数类型和返回代码的含义。然而,从内核设计者的角度来看,这种区别确实很重要,因为系统调用属于内核,而用户模式库则不然。
From the programmer's point of view, the distinction between an API and a system call is irrelevant—the only things that matter are the function name, the parameter types, and the meaning of the return code. From the kernel designer's point of view, however, the distinction does matter because system calls belong to the kernel, while User Mode libraries don't.
大多数包装例程返回一个整数值,其含义取决于相应的系统调用。返回值 -1 通常表示内核无法满足进程请求。系统调用处理程序中的故障可能是由无效参数、缺乏可用资源、硬件问题等引起的。具体的错误代码包含在变量中,该变量在libcerrno库中定义
。
Most wrapper routines return an integer value, whose meaning
depends on the corresponding system call. A return value of -1 usually
indicates that the kernel was unable to satisfy the process request. A
failure in the system call handler may be caused by invalid parameters,
a lack of available resources, hardware problems, and so on. The
specific error code is contained in the errno variable, which is defined in the
libc library.
每个错误代码被定义为一个宏常量,它产生一个相应的正整数值。POSIX 标准指定了几个错误代码的宏名称。在 Linux 中,在 80 × 86 系统上,这些宏在头文件include/asm-i386/errno.h中定义。为了允许 C 程序在 Unix 系统之间移植,include/asm-i386/errno.h头文件又包含在标准/usr/include/errno.h C 库头文件中。其他系统有自己专门的头文件子目录。
Each error code is defined as a macro constant, which yields a corresponding positive integer value. The POSIX standard specifies the macro names of several error codes. In Linux, on 80 × 86 systems, these macros are defined in the header file include/asm-i386/errno.h. To allow portability of C programs among Unix systems, the include/asm-i386/errno.h header file is included, in turn, in the standard /usr/include/errno.h C library header file. Other systems have their own specialized subdirectories of header files.
当用户态进程调用系统调用时,CPU 切换到内核态并开始执行内核函数。正如我们将在下一节中看到的,在 80 × 86 架构中,可以通过两种不同的方式调用 Linux 系统调用。然而,这两种方法的最终结果都是跳转到称为 系统调用处理程序的汇编语言函数。
When a User Mode process invokes a system call, the CPU switches to Kernel Mode and starts the execution of a kernel function. As we will see in the next section, in the 80 × 86 architecture a Linux system call can be invoked in two different ways. The net result of both methods, however, is a jump to an assembly language function called the system call handler.
由于内核实现了许多不同的系统调用,因此用户态进程必须传递一个称为系统调用号的参数来标识所需的系统调用;eaxLinux 使用该寄存器来实现此目的。正如我们将在本章后面的“参数传递”部分中看到的,在调用系统调用时通常会传递附加参数。
Because the kernel implements many different system calls, the
User Mode process must pass a parameter called the system call
number to identify the required system call; the eax register is used by Linux for this
purpose. As we'll see in the section "Parameter Passing" later in
this chapter, additional parameters are usually passed when invoking a
system call.
所有系统调用都返回一个整数值。这些返回值的约定与包装器例程的约定不同。在内核中,正值或 0 值表示系统调用成功终止,而负值表示错误情况。在后一种情况下,该值是必须返回到变量中的应用程序的错误代码的否定errno。该errno变量未被内核设置或使用。相反,包装例程在系统调用返回后处理设置此变量的任务。
All system calls return an integer value. The conventions for
these return values are different from those for wrapper routines. In
the kernel, positive or 0 values denote a successful termination of the
system call, while negative values denote an error condition. In the
latter case, the value is the negation of the error code that must be
returned to the application program in the errno variable. The errno variable is not set or used by the
kernel. Instead, the wrapper routines handle the task of setting this
variable after a return from a system call.
系统调用处理程序的结构与其他异常处理程序类似,执行以下操作:
The system call handler, which has a structure similar to that of the other exception handlers, performs the following operations:
将大多数寄存器的内容保存在内核模式堆栈中(此操作对于所有系统调用都是通用的,并且用汇编语言编码)。
Saves the contents of most registers in the Kernel Mode stack (this operation is common to all system calls and is coded in assembly language).
通过调用称为系统调用服务例程的相应 C 函数来处理系统调用。
Handles the system call by invoking a corresponding C function called the system call service routine.
退出处理程序:寄存器加载内核模式堆栈中保存的值,CPU 从内核模式切换回用户模式(此操作对所有系统调用都是通用的,并以汇编语言编码)。
Exits from the handler: the registers are loaded with the values saved in the Kernel Mode stack, and the CPU is switched back from Kernel Mode to User Mode (this operation is common to all system calls and is coded in assembly language).
与系统调用相关的服务例程的名称
通常是;然而,这一规则也有一些例外。xyz ( )sys_
xyz ( )
The name of the service routine associated with the
xyz ( )
system call is usually sys_
xyz ( );
there are, however, a few exceptions to this rule.
图10-1
说明了调用系统调用的应用程序、相应的包装例程、系统调用处理程序和系统调用服务例程之间的关系。箭头表示函数之间的执行流程。术语“ SYSCALL”和“ SYSEXIT”是实际汇编语言指令的占位符,它们分别将CPU从用户模式切换到内核模式以及从内核模式切换到用户模式。
Figure 10-1
illustrates the relationships between the application program that
invokes a system call, the corresponding wrapper routine, the system
call handler, and the system call service routine. The arrows denote the
execution flow between the functions. The terms "SYSCALL" and "SYSEXIT" are placeholders for the actual
assembly language instructions that switch the CPU, respectively, from
User Mode to Kernel Mode and from Kernel Mode to User Mode.
为了将每个系统调用号与其相应的服务例程相关联,内核使用系统调用调度表,该表存储在sys_call_table数组中并具有NR_syscalls条目(Linux 2.6.11 内核中为 289)。第n 个条目包含编号为n的系统调用的服务例程地址
。
To associate each system call number with its corresponding
service routine, the kernel uses a system call dispatch
table, which is stored in the sys_call_table array and has NR_syscalls entries (289 in the Linux 2.6.11
kernel). The n th entry
contains the service routine address of the system call having number
n.
宏NR_syscalls只是对可实现的系统调用最大数量的静态限制;它并不表明实际执行的系统调用数量。事实上,调度表的每个条目都可能包含函数的地址
sys_ni_syscall( ),该函数是“未实现的”系统调用的服务例程;它只是返回错误代码-ENOSYS。
The NR_syscalls macro is just a
static limit on the maximum number of implementable system calls; it
does not indicate the number of system calls actually implemented.
Indeed, each entry of the dispatch table may contain the address of the
sys_ni_syscall( ) function, which is
the service routine of the "nonimplemented" system calls; it just
returns the error code -ENOSYS.
本机应用程序[ * ]可以通过两种不同的方式调用系统调用:
Native applications[*] can invoke a system call in two different ways:
By executing the int $0x80 assembly language instruction; in
older versions of the Linux kernel, this was the only way to switch
from User Mode to Kernel Mode.
通过执行sysenter
Intel Pentium II 微处理器中引入的汇编语言指令;Linux 2.6 内核现在支持该指令。
By executing the sysenter
assembly language instruction, introduced in the
Intel Pentium II microprocessors; this instruction is now supported
by the Linux 2.6 kernel.
同样,内核可以通过两种方式退出系统调用,从而将 CPU 切换回用户模式:
Similarly, the kernel can exit from a system call—thus switching the CPU back to User Mode—in two ways:
然而,支持两种不同的方式进入内核并不像看起来那么简单,因为:
However, supporting two different ways to enter the kernel is not as simple as it might look, because:
内核必须支持仅使用该
int $0x80指令的旧库和也使用该sysenter指令的较新库。
The kernel must support both older libraries that only use the
int $0x80 instruction and more
recent ones that also use the sysenter instruction.
使用该sysenter指令的标准库必须能够应对仅支持该int $0x80指令的旧内核。
A standard library that makes use of the sysenter instruction must be able to cope
with older kernels that support only the int $0x80 instruction.
内核和标准库必须能够在不包含该sysenter指令的较旧处理器和包含该指令的较新处理器上运行。
The kernel and the standard library must be able to run both
on older processors that do not include the sysenter instruction and on more recent
ones that include it.
我们将在本章后面的“通过 sysenter 指令发出系统调用”一节中看到Linux 内核如何解决这些兼容性问题。
We will see in the section "Issuing a System Call via the sysenter Instruction" later in this chapter how the Linux kernel solves these compatibility problems.
调用系统调用的“传统”方法是使用
汇编语言指令,这在第 4 章的“中断和异常的硬件处理”int部分中进行了讨论。
The "traditional" way to invoke a system call makes use of the
int assembly language instruction,
which was discussed in the section "Hardware Handling of Interrupts
and Exceptions" in Chapter
4.
向量 128(十六进制)0x80与内核入口点相关联。该trap_init( )函数在内核初始化期间调用,设置与向量 128 对应的中断描述符表条目,如下所示:
The vector 128—in hexadecimal, 0x80—is associated with the kernel entry
point. The trap_init( ) function,
invoked during kernel initialization, sets up the Interrupt Descriptor
Table entry corresponding to vector 128 as follows:
set_system_gate(0x80,&system_call);
set_system_gate(0x80, &system_call);
该调用将以下值加载到门描述符字段中(请参阅第 4 章中的“中断、陷阱和系统门”部分):
The call loads the following values into the gate descriptor fields (see the section "Interrupt, Trap, and System Gates" in Chapter 4):
_ _KERNEL_CS
内核代码段的段选择器。
The _ _KERNEL_CS
Segment Selector of the kernel code segment.
指向系统调用处理程序的指针system_call(
)。
The pointer to the system_call(
) system call handler.
设置为 15。表示异常是陷阱并且相应的处理程序不会禁用可屏蔽中断。
Set to 15. Indicates that the exception is a Trap and that the corresponding handler does not disable maskable interrupts.
Set to 3. This allows processes in User Mode to invoke the exception handler (see the section "Hardware Handling of Interrupts and Exceptions" in Chapter 4).
因此,当用户态进程发出int $0x80指令时,CPU会切换到内核态并开始从该system_call地址执行指令。
Therefore, when a User Mode process issues an int $0x80 instruction, the CPU switches into
Kernel Mode and starts executing instructions from the system_call address.
该system_call( )函数首先将系统调用号和异常处理程序可能使用的所有 CPU 寄存器保存在堆栈上,除了
eflags、cs、eip、ss和esp,这些寄存器已由控制单元自动保存(请参阅“硬件”部分)中断和异常的处理”(第 4 章)。该宏已在第 4 章“ I/O 中断处理”SAVE_ALL一节中讨论过,它还将内核数据段的段选择器加载到 和中:dses
The system_call( ) function
starts by saving the system call number and all the CPU registers
that may be used by the exception handler on the stack—except for
eflags, cs, eip, ss, and esp, which have already been saved
automatically by the control unit (see the section "Hardware Handling of
Interrupts and Exceptions" in Chapter 4). The SAVE_ALL macro, which was already
discussed in the section "I/O Interrupt Handling"
in Chapter 4, also loads
the Segment Selector of the kernel data segment in ds and es:
系统调用:
推入%eax
保存全部
movl $0xffffe000, %ebx /* 或 0xfffff000 对于 4 KB 堆栈 */
andl %esp, %ebx system_call:
pushl %eax
SAVE_ALL
movl $0xffffe000, %ebx /* or 0xfffff000 for 4-KB stacks */
andl %esp, %ebxthread_info然后该函数将当前进程的数据结构的地址存储在(参见第3章中的“识别进程”ebx部分)。这是通过获取内核堆栈指针的值并将其四舍五入到 4 或 8 KB 的倍数来完成的(请参阅第3 章中的“识别进程”部分)。
The function then stores the address of the thread_info data structure of the current
process in ebx (see the section
"Identifying a
Process" in Chapter
3). This is done by taking the value of the kernel stack
pointer and rounding it up to a multiple of 4 or 8 KB (see the
section "Identifying a
Process" in Chapter
3).
接下来,该函数检查结构字段中包含的
和标志system_call( )
之一是否已设置,即调试器是否正在跟踪已执行程序的系统调用。如果是这种情况,则调用该函数两次:一次在系统调用服务例程执行之前,一次在执行之后(如下所述)。该函数停止,从而允许调试进程收集有关它的信息。TIF_SYSCALL_TRACETIF_SYSCALL_AUDITflagsthread_infosystem_call( )do_syscall_trace( )current
Next, the system_call( )
function checks whether either one of the TIF_SYSCALL_TRACE and TIF_SYSCALL_AUDIT flags included in the
flags field of the thread_info structure is set—that is,
whether the system call invocations of the executed program are
being traced by a debugger. If this is the case, system_call( ) invokes the do_syscall_trace( ) function twice: once
right before and once right after the execution of the system call
service routine (as described later). This function stops current and thus allows the debugging
process to collect information about it.
然后对用户模式进程传递的系统调用号执行有效性检查。如果它大于或等于系统调用调度表中的条目数,则系统调用处理程序终止:
A validity check is then performed on the system call number passed by the User Mode process. If it is greater than or equal to the number of entries in the system call dispatch table, the system call handler terminates:
cmpl $NR_syscalls, %eax
新山诺巴德系统公司
movl $(-ENOSYS), 24(%esp)
jmp 恢复用户空间
诺巴德系统: cmpl $NR_syscalls, %eax
jb nobadsys
movl $(-ENOSYS), 24(%esp)
jmp resume_userspace
nobadsys:如果系统调用号无效,则函数将-ENOSYS值存储在保存寄存器的堆栈位置中eax,即距离当前堆栈顶部偏移 24 处。然后它跳转到resume_userspace
(见下文)。这样,当进程在用户态恢复执行时,它会在 中找到负返回码eax。
If the system call number is not valid, the function stores
the -ENOSYS value in the stack
location where the eax register
has been saved—that is, at offset 24 from the current stack top. It
then jumps to resume_userspace
(see below). In this way, when the process resumes its execution in
User Mode, it will find a negative return code in eax.
eax最后,调用与包含的系统调用号相关的特定服务例程:
Finally, the specific service routine associated with the
system call number contained in eax is invoked:
调用 *sys_call_table(0, %eax, 4)
call *sys_call_table(0, %eax, 4)
由于调度表中的每个条目都是4字节长,因此内核通过将系统调用号乘以4,加上调度表的初始地址,并提取指向服务例程的指针来找到要调用的服务例程的sys_call_table地址从表中的那个位置。
Because each entry in the dispatch table is 4 bytes long, the
kernel finds the address of the service routine to be invoked by
multiplying the system call number by 4, adding the initial address
of the sys_call_table dispatch
table, and extracting a pointer to the service routine from that
slot in the table.
当系统调用服务例程终止时,
system_call( )函数从保存eax用户模式值的寄存器中获取其返回码并将其存储在堆栈位置中eax:
When the system call service routine terminates, the
system_call( ) function gets its
return code from eax and stores
it in the stack location where the User Mode value of the eax register is saved:
movl %eax, 24(%esp)
movl %eax, 24(%esp)
这样,用户态进程就会在寄存器中找到系统调用的返回码eax
。
Thus, the User Mode process will find the return code of the
system call in the eax
register.
然后,该函数禁用本地中断并检查结构system_call( )
中的标志
:thread_infocurrent
Then, the system_call( )
function disables the local interrupts and checks the flags in the
thread_info structure of current:
命令行
movl 8(%ebp), %ecx
测试$0xffff,%cx
我恢复全部 cli
movl 8(%ebp), %ecx
testw $0xffff, %cx
je restore_all该flags字段位于thread_info
结构中的偏移量 8 处;掩码选择与表 4-150xffff
中列出的所有标志相对应的位,除了
。如果这些标志均未设置,则函数跳转到标签:如第 4 章“从中断和异常返回”部分所述,此代码恢复保存在内核模式堆栈上的寄存器内容并执行TIF_POLLING_NRFLAGrestore_alliret 汇编语言指令恢复用户模式进程。(可参考图4-6的流程图。)
The flags field is at
offset 8 in the thread_info
structure; the mask 0xffff
selects the bits corresponding to all flags listed in Table 4-15 except
TIF_POLLING_NRFLAG. If none of
these flags is set, the function jumps to the restore_all label: as described in the
section "Returning from
Interrupts and Exceptions" in Chapter 4, this code restores
the contents of the registers saved on the Kernel Mode stack and
executes an iret assembly language instruction to resume the User Mode
process. (You might refer to the flow diagram in Figure 4-6.)
如果设置了任何标志,则在返回用户模式之前需要完成一些工作。如果TIF_SYSCALL_TRACE设置了该标志,则该
system_call( )函数将第二次调用该do_syscall_trace(
)函数,然后跳转到该resume_userspace标签。否则,如果
TIF_SYSCALL_TRACE未设置标志,则函数跳转到work_pending标签。
If any of the flags is set, then there is some work to be done
before returning to User Mode. If the TIF_SYSCALL_TRACE flag is set, the
system_call( ) function invokes
for the second time the do_syscall_trace(
) function, then jumps to the resume_userspace label. Otherwise, if the
TIF_SYSCALL_TRACE flag is not
set, the function jumps to the work_pending label.
正如第 4章“从中断和异常返回”一节中所解释的,和标签处的代码检查重新调度请求、虚拟 8086 模式、挂起信号和单步执行;然后最终跳转到标签以恢复用户模式进程的执行。resume_userspacework_pendingrestore_all
As explained in the section "Returning from Interrupts and
Exceptions" in Chapter
4, that code at the resume_userspace and work_pending labels checks for
rescheduling requests, virtual-8086 mode, pending signals, and
single stepping; then eventually a jump is done to the restore_all label to resume the execution
of the User Mode process.
汇编int语言指令本质上很慢,因为它执行多项一致性和安全性检查。(该指令在第4章“中断和异常的硬件处理”一节中有详细描述。)
The int assembly
language instruction is inherently slow because it performs several
consistency and security checks. (The instruction is described in
detail in the section "Hardware Handling of Interrupts
and Exceptions" in Chapter
4.)
该sysenter指令在英特尔文档中被称为“快速系统调用”,提供了一种从用户模式切换到内核模式的更快方法。
The sysenter instruction,
dubbed in Intel documentation as "Fast System Call," provides a faster
way to switch from User Mode to Kernel Mode.
汇编sysenter
语言指令使用三个特殊寄存器,必须加载以下信息:[ * ]
The sysenter
assembly language instruction makes use of three special registers
that must be loaded with the following information:[*]
SYSENTER_CS_MSRSYSENTER_CS_MSR内核代码段的段选择器
The Segment Selector of the kernel code segment
SYSENTER_EIP_MSRSYSENTER_EIP_MSR内核入口点的线性地址
The linear address of the kernel entry point
SYSENTER_ESP_MSRSYSENTER_ESP_MSR内核堆栈指针
The kernel stack pointer
当sysenter
指令执行时,CPU控制单元:
When the sysenter
instruction is executed, the CPU control unit:
将 的内容复制SYSENTER_CS_MSR到cs.
Copies the content of SYSENTER_CS_MSR into cs.
将 的内容复制SYSENTER_EIP_MSR到eip.
Copies the content of SYSENTER_EIP_MSR into eip.
将 的内容复制SYSENTER_ESP_MSR到esp.
Copies the content of SYSENTER_ESP_MSR into esp.
将 的值加 8 SYSENTER_CS_MSR,并将该值加载到 中ss。
Adds 8 to the value of SYSENTER_CS_MSR, and loads this value
into ss.
因此,CPU切换到内核模式并开始执行内核入口点的第一条指令。正如我们在第2章“ Linux GDT ”一节中看到的,内核堆栈段与内核数据段重合,对应的描述符遵循全局描述符表中内核代码段的描述符;因此,步骤 4 将正确的段选择器加载到寄存器中
。ss
Therefore, the CPU switches to Kernel Mode and starts
executing the first instruction of the kernel entry point. As we
have seen in the section "The Linux GDT" in Chapter 2, the kernel stack
segment coincides with the kernel data segment, and the
corresponding descriptor follows the descriptor of the kernel code
segment in the Global Descriptor Table; therefore, step 4 loads the
proper Segment Selector in the ss
register.
这三个特定于模型的寄存器由该函数初始化
enable_sep_cpu( ),该函数在内核初始化期间由系统中的每个 CPU 执行一次。该函数执行以下步骤:
The three model-specific registers are initialized by the
enable_sep_cpu( ) function, which
is executed once by every CPU in the system during the
initialization of the kernel. The function performs the following
steps:
编写内核代码的段选择器(_ _KERNEL_CS ) 写入寄存器SYSENTER_CS_MSR。
Writes the Segment Selector of the kernel code (_ _KERNEL_CS) in the SYSENTER_CS_MSR register.
SYSENTER_CS_EIP将线性地址写入寄存器sysenter_entry(
)将下述函数
Writes in the SYSENTER_CS_EIP register the linear
address of the sysenter_entry(
) function described below.
计算本地 TSS 末尾的线性地址,并将该值写入寄存器SYSENTER_CS_ESP。[ * ]
Computes the linear address of the end of the local TSS,
and writes this value in the SYSENTER_CS_ESP register.[*]
寄存器的设置SYSENTER_CS_ESP值得一些评论。当系统调用开始时,内核堆栈是空的,因此esp寄存器应该指向包含内核堆栈和当前进程描述符的4或8KB内存区域的末尾(见图3-2 )。用户模式包装例程无法正确设置该寄存器,因为它不知道该内存区域的地址;另一方面,在切换到内核模式之前必须设置寄存器的值。因此,内核初始化该寄存器以编码本地CPU的任务状态段的地址。正如我们在该函数的步骤 3 中所描述的那样_ _switch_to(
)(请参阅“执行第 3 章中的进程切换),每次进程切换时,内核都会将当前进程的内核堆栈指针保存在esp0本地 TSS 字段中。因此,系统调用处理程序读取寄存器esp
,计算该esp0字段的地址本地 TSS,并将esp正确的内核堆栈指针加载到同一寄存器中。
The setting of the SYSENTER_CS_ESP register deserves some
comments. When a system call starts, the kernel stack is empty, thus
the esp register should point to
the end of the 4- or 8-KB memory area that includes the kernel stack
and the descriptor of the current process (see Figure 3-2). The User Mode
wrapper routine cannot properly set this register, because it does
not know the address of this memory area; on the other hand, the
value of the register must be set before switching to Kernel Mode.
Therefore, the kernel initializes the register so as to encode the
address of the Task State Segment of the local CPU. As we have
described in step 3 of the _ _switch_to(
) function (see the section "Performing the Process
Switch" in Chapter
3), at every process switch the kernel saves the kernel stack
pointer of the current process in the esp0 field of the local TSS. Thus, the
system call handler reads the esp
register, computes the address of the esp0 field of the local TSS, and loads
into the same esp register the
proper kernel stack pointer.
仅当 CPU 和 Linux 内核都支持该指令时,libc标准库中的包装函数
才能使用该指令。sysenter
A wrapper function in the libc
standard library can make use of the sysenter instruction only if both the CPU
and the Linux kernel support it.
这个兼容性问题需要一个相当复杂的解决方案。本质上,在初始化阶段该sysenter_setup( )函数构建了一个称为vsyscall page 的页框 包含一个小的ELF共享对象(即一个小的ELF动态库)。当一个进程发出一个execve( ) 系统调用开始执行一个ELF程序,vsyscall页中的代码动态链接到进程地址空间(参见第20章“ exec函数”
一节)。vsyscall 页面中的代码利用最佳可用指令来发出系统调用。
This compatibility problem calls for a quite sophisticated
solution. Essentially, in the initialization phase the sysenter_setup( ) function builds a page
frame called vsyscall page containing a small ELF shared object (i.e., a tiny
ELF dynamic library). When a process issues an execve( ) system call to start executing an ELF program, the
code in the vsyscall page is dynamically linked to the process
address space (see the section "The exec Functions" in
Chapter 20). The code in
the vsyscall page makes use of the best available instruction to
issue a system call.
该sysenter_setup( )
函数为 vsyscall 页分配一个新的页框,并将其物理地址与固定映射线性地址相关联(请参阅第 2 章中的“固定映射线性地址”FIX_VSYSCALL部分)。然后,该函数在页面中复制两个预定义的 ELF 共享对象之一:
The sysenter_setup( )
function allocates a new page frame for the vsyscall page and
associates its physical address with the FIX_VSYSCALL fix-mapped linear address
(see the section "Fix-Mapped Linear
Addresses" in Chapter
2). Then, the function copies in the page either one of two
predefined ELF shared objects:
如果CPU不支持sysenter,该函数将构建一个vsyscall页面,其中包含以下代码:
_ _kernel_vsyscall:
整数
$0x80
雷特If the CPU does not support sysenter, the function builds a
vsyscall page that includes the code:
_ _kernel_vsyscall:
int
$0x80
ret否则,如果 CPU 确实支持sysenter,该函数将构建一个包含以下代码的 vsyscall 页面:
_ _kernel_vsyscall:
推 %ecx
推 %edx
推入%ebp
movl %esp, %ebp
系统输入器Otherwise, if the CPU does support sysenter, the function builds a
vsyscall page that includes the code:
_ _kernel_vsyscall:
pushl %ecx
pushl %edx
pushl %ebp
movl %esp, %ebp
sysenter当标准库中的包装例程必须调用系统调用时,它会调用该_
_kernel_vsyscall( )函数,无论该函数是什么。
When a wrapper routine in the standard library must invoke a
system call, it calls the _
_kernel_vsyscall( ) function, whatever it may be.
最后一个兼容性问题是由于旧版本的Linux内核不支持该sysenter指令;当然,在这种情况下,内核不会构建 vsyscall 页面,并且该_ _kernel_vsyscall( )函数不会链接到用户模式进程的地址空间。当最近的标准库认识到这一事实时,它们只需执行指令
int $0x80来调用系统调用。
A final compatibility problem is due to old versions of the
Linux kernel that do not support the sysenter instruction; in this case, of
course, the kernel does not build the vsyscall page and the _ _kernel_vsyscall( ) function is not
linked to the address space of the User Mode processes. When recent
standard libraries recognize this fact, they simply execute the
int $0x80 instruction to invoke
the system calls.
通过指令发出系统调用时执行的步骤顺序sysenter
如下:
The sequence of steps performed when a system call is
issued via the sysenter
instruction is the following:
标准库中的包装例程将系统调用号加载到寄存器中eax并调用该_ _kernel_vsyscall( )函数。
The wrapper routine in the standard library loads the
system call number into the eax register and calls the _ _kernel_vsyscall( ) function.
该函数将、、 和_ _kernel_vsyscall(
)的内容保存在用户模式堆栈上(这些寄存器将由系统调用处理程序使用),将用户堆栈指针复制到 中,然后执行
指令。ebpedxecxebpsysenter
The _ _kernel_vsyscall(
) function saves on the User Mode stack the contents
of ebp, edx, and ecx (these registers are going to be
used by the system call handler), copies the user stack pointer
in ebp, then executes the
sysenter instruction.
CPU从用户态切换到内核态,内核开始执行函数sysenter_entry( )(寄存器指向的SYSENTER_EIP_MSR
)。
The CPU switches from User Mode to Kernel Mode, and the
kernel starts executing the sysenter_entry( ) function (pointed to
by the SYSENTER_EIP_MSR
register).
汇编sysenter_entry( )
语言函数执行以下步骤:
设置内核堆栈指针:
movl -508(%esp), %esp
最初,esp
寄存器指向本地 TSS 之后的第一个位置,长度为 512 字节。因此,该指令将esp本地TSS中偏移4处的字段内容加载到寄存器中,即该字段的内容esp0
。正如已经解释过的,该esp0字段始终存储当前进程的内核堆栈指针。
启用本地中断:
科学技术
内核态堆栈中保存了用户数据段的段选择器、当前用户堆栈指针、
eflags 寄存器,用户代码段的段选择器,以及退出系统调用时要执行的指令的地址:
推 $(__USER_DS)
推入%ebp
普什弗
推 $(__USER_CS)
推送 $SYSENTER_RETURN观察到这些指令模拟了汇编语言指令执行的一些操作(第 4 章“中断和异常的硬件处理”部分int描述中的步骤 5c 和 7 )。int
恢复ebp包装例程传递的寄存器的原始值:
movl (%ebp), %ebp
该指令完成了这项工作,因为_ _kernel_vsyscall( )将原始值保存在用户模式堆栈上ebp,然后加载到ebp用户堆栈指针的当前值中。
system_call通过执行与前面部分“通过 int $0x80 指令发出系统调用”中描述的标签处相同的指令序列来调用系统调用处理程序
。
The sysenter_entry( )
assembly language function performs the following steps:
Sets up the kernel stack pointer:
movl -508(%esp), %esp
Initially, the esp
register points to the first location after the local TSS,
which is 512bytes long. Therefore, the instruction loads in
the esp register the
contents of the field at offset 4 in the local TSS, that is,
the contents of the esp0
field. As already explained, the esp0 field always stores the
kernel stack pointer of the current process.
Enables local interrupts:
sti
Saves in the Kernel Mode stack the Segment Selector of
the user data segment, the current user stack pointer, the
eflags register, the Segment Selector of the user
code segment, and the address of the instruction to be
executed when exiting from the system call:
pushl $(__USER_DS)
pushl %ebp
pushfl
pushl $(__USER_CS)
pushl $SYSENTER_RETURNObserve that these instructions emulate some
operations performed by the int assembly language instruction
(steps 5c and 7 in the description of int in the section "Hardware Handling of
Interrupts and Exceptions" in Chapter 4).
Restores in ebp the
original value of the register passed by the wrapper
routine:
movl (%ebp), %ebp
This instruction does the job, because _ _kernel_vsyscall( ) saved on the
User Mode stack the original value of ebp and then loaded in ebp the current value of the user
stack pointer.
Invokes the system call handler by executing a
sequence of instructions identical to that starting at the
system_call label
described in the earlier section "Issuing a System Call
via the int $0x80 Instruction."
当系统调用服务例程终止时,该
sysenter_entry( )函数执行与该函数基本相同的操作system_call( )(请参阅上一节)。首先,它获取系统调用服务例程的返回码,并将其存储在保存寄存器eax的用户模式值的内核堆栈位置中。然后,该函数禁用本地中断并检查结构eax中的标志。thread_infocurrent
When the system call service routine terminates, the
sysenter_entry( ) function
executes essentially the same operations as the system_call( ) function (see previous
section). First, it gets the return code of the system call service
routine from eax and stores it in
the kernel stack location where the User Mode value of the eax register is saved. Then, the function
disables the local interrupts and checks the flags in the thread_info structure of current.
如果设置了任何标志,则在返回用户模式之前需要完成一些工作。为了避免代码重复,这种情况的处理与函数中完全相同system_call( ),因此函数跳转到resume_userspace或
标签(参见第4章图4-6中work_pending的流程图)。最终,iret 汇编语言指令从内核模式堆栈中获取函数在步骤 4c 中保存的五个参数sysenter_entry( ),从而将 CPU 切换回用户模式并开始执行SYSENTER_RETURN标签处的代码(见下文)。
If any of the flags is set, then there is some work to be done
before returning to User Mode. In order to avoid code duplication,
this case is handled exactly as in the system_call( ) function, thus the function
jumps to the resume_userspace or
work_pending labels (see flow
diagram in Figure
4-6 in Chapter 4).
Eventually, the iret assembly language instruction fetches from the Kernel
Mode stack the five arguments saved in step 4c by the sysenter_entry( ) function, and thus
switches the CPU back to User Mode and starts executing the code at
the SYSENTER_RETURN label (see
below).
如果该sysenter_entry( )
函数确定标志已清除,它将快速返回到用户模式:
If the sysenter_entry( )
function determines that the flags are cleared, it performs a quick
return to User Mode:
movl 40(%esp), %edx
movl 52(%esp), %ecx
xorl %ebp, %ebp
科学技术
系统退出 movl 40(%esp), %edx
movl 52(%esp), %ecx
xorl %ebp, %ebp
sti
sysexit和寄存器加载了上节步骤 4c 中保存的几个堆栈值:获取标签的地址,同时获取当前用户数据堆栈指针edx。ecxsysenter_entry(
)edxSYSENTER_RETURNecx
The edx and ecx registers are loaded with a couple of
the stack values saved by sysenter_entry(
) in step 4c in the previos section: edx gets the address of the SYSENTER_RETURN label, while ecx gets the current user data stack
pointer.
汇编sysexit语言指令是sysenter:它允许从内核模式快速切换到用户模式。当指令被执行时,CPU控制单元执行以下步骤:
The sysexit assembly
language instruction is the companion of sysenter: it allows a fast switch from
Kernel Mode to User Mode. When the instruction is executed, the CPU
control unit performs the following steps:
将寄存器中的值加 16 SYSENTER_CS_MSR,并将结果加载到cs
寄存器中。
Adds 16 to the value in the SYSENTER_CS_MSR register, and loads
the result in the cs
register.
将寄存器的内容复制edx到eip寄存器中。
Copies the content of the edx register into the eip register.
将寄存器中的值加 24 SYSENTER_CS_MSR,并将结果加载到ss
寄存器中。
Adds 24 to the value in the SYSENTER_CS_MSR register, and loads
the result in the ss
register.
将寄存器的内容复制ecx到esp寄存器中。
Copies the content of the ecx register into the esp register.
因为SYSENTER_CS_MSR
寄存器中加载的是内核代码的段选择器,寄存器中
cs加载的是用户代码的段选择器,而寄存器中ss加载的是用户数据段的段选择器(参见“ Linux GDT ”一节)第2 章)。
Because the SYSENTER_CS_MSR
register is loaded with the Segment Selector of the kernel code, the
cs register is loaded with the
Segment Selector of the user code, while the ss register is loaded with the Segment
Selector of the user data segment (see the section "The Linux GDT" in Chapter 2).
结果,CPU从内核模式切换到用户模式,并开始执行地址存储在寄存器中的指令
edx。
As a result, the CPU switches from Kernel Mode to User Mode
and starts executing the instruction whose address is stored in the
edx register.
标签处的代码SYSENTER_RETURN存储在vsyscall页中,当via输入的系统调用
sysenter被终止时,无论是由iret指令还是由sysexit指令终止,它都会被执行。
The code at the SYSENTER_RETURN label is stored in the
vsyscall page, and it is executed when a system call entered via
sysenter is being terminated,
either by the iret instruction or
the sysexit instruction.
该代码只是恢复保存在用户模式堆栈中的ebp、edx、 和寄存器的原始内容,并将控制权返回给标准库中的包装例程:ecx
The code simply restores the original contents of the ebp, edx, and ecx registers saved in the User Mode
stack, and returns the control to the wrapper routine in the
standard library:
SYSENTER_RETURN:
人口 %ebp
人口 %edx
人口%ecx
雷特 SYSENTER_RETURN:
popl %ebp
popl %edx
popl %ecx
ret[ * ]正如我们将在第 20 章的“执行域” 部分中看到的,Linux 可以执行为“外国”操作系统编译的程序。因此,内核提供了一种兼容模式来进入系统调用:执行iBCS和Solaris的用户模式进程/x86 程序可以通过跳转到默认本地描述符表中包含的合适调用门来进入内核(请参阅第 2 章中的“ Linux LDT ”部分)。
[*] As we will see in the section "Execution Domains" in Chapter 20, Linux can execute programs compiled for "foreign" operating systems. Therefore, the kernel offers a compatibility mode to enter a system call: User Mode processes executing iBCS and Solaris /x86 programs can enter the kernel by jumping into suitable call gates included in the default Local Descriptor Table (see the section "The Linux LDTs" in Chapter 2).
[ * ] “MSR”是“Model-Specific Register”的缩写,表示仅在某些型号的 80 × 86 微处理器中存在的寄存器。
[*] "MSR" is an acronym for "Model-Specific Register" and denotes a register that is present only in some models of 80 × 86 microprocessors.
[ * ]写入的本地TSS地址的编码
SYSENTER_ESP_MSR是因为寄存器应该指向一个真实的堆栈,该堆栈向低地址增长。实际上,用任何值初始化寄存器都可以,只要可以从该值获取本地 TSS 的地址即可。
[*] The encoding of the local TSS address written in
SYSENTER_ESP_MSR is due
to the fact that the register should point to a real stack,
which grows towards lower address. In practice, initializing
the register with any value would work, provided that it is
possible to get the address of the local TSS from such a
value.
与普通函数一样,系统调用通常需要一些输入/输出参数,这些参数可能包括实际值(即数字)、用户态进程地址空间中的变量地址,甚至包括指向用户态指针的数据结构的地址。函数(参见第 11 章中的“与信号处理相关的系统调用”部分)。
Like ordinary functions, system calls often require some input/output parameters, which may consist of actual values (i.e., numbers), addresses of variables in the address space of the User Mode process, or even addresses of data structures including pointers to User Mode functions (see the section "System Calls Related to Signal Handling" in Chapter 11).
因为system_call( )和sysenter_entry( )函数是Linux中所有系统调用的公共入口点,所以它们中的每一个都至少有一个参数:寄存器中传递的系统调用号eax。例如,如果应用程序调用fork( )
包装例程中,在执行或汇编语言指令之前,eax寄存器被设置为 2(即) 。由于寄存器是由
libc库中包含的包装例程设置的,因此程序员通常不关心系统调用号。_ _NR_forkint $0x80sysenter
Because the system_call( ) and
the sysenter_entry( ) functions are
the common entry points for all system calls in Linux, each of them has
at least one parameter: the system call number passed in the eax register. For instance, if an application
program invokes the fork( )
wrapper routine, the eax register is set to 2 (i.e., _ _NR_fork) before executing the int $0x80 or sysenter assembly language instruction.
Because the register is set by the wrapper routines included in the
libc library, programmers do not usually care about
the system call number.
该fork( )系统调用不需要其他参数。然而,许多系统调用确实需要额外的参数,这些参数必须由应用程序显式传递。例如,mmap(
) 系统调用最多可能需要六个附加参数(除了系统调用号之外)。
The fork( ) system call does
not require other parameters. However, many system calls do require
additional parameters, which must be explicitly passed by the
application program. For instance, the mmap(
) system call may require up to six additional parameters
(besides the system call number).
普通 C 函数的参数通常通过将其值写入活动程序堆栈(用户模式堆栈或内核模式堆栈)来传递。因为系统调用是一种从用户域跨越到内核域的特殊函数,所以用户模式或内核模式堆栈都不能使用。相反,系统调用参数在发出系统调用之前写入 CPU 寄存器中。然后内核在调用系统调用服务例程之前将存储在CPU寄存器中的参数复制到内核模式堆栈上,因为后者是普通的C函数。
The parameters of ordinary C functions are usually passed by writing their values in the active program stack (either the User Mode stack or the Kernel Mode stack). Because system calls are a special kind of function that cross over from user to kernel land, neither the User Mode or the Kernel Mode stacks can be used. Rather, system call parameters are written in the CPU registers before issuing the system call. The kernel then copies the parameters stored in the CPU registers onto the Kernel Mode stack before invoking the system call service routine, because the latter is an ordinary C function.
为什么内核不直接将参数从用户模式堆栈复制到内核模式堆栈?首先,同时使用两个堆栈很复杂;其次,寄存器的使用使得系统调用处理程序的结构与其他异常处理程序的结构相似。
Why doesn't the kernel copy parameters directly from the User Mode stack to the Kernel Mode stack? First of all, working with two stacks at the same time is complex; second, the use of registers makes the structure of the system call handler similar to that of other exception handlers.
然而,要在寄存器中传递参数,必须满足两个条件:
However, to pass parameters in registers, two conditions must be satisfied:
每个参数的长度不能超过寄存器的长度(32位)。[ * ]
The length of each parameter cannot exceed the length of a register (32 bits).[*]
除了传入的系统调用号之外,参数的数量不得超过 6 个,eax因为 80 × 86 处理器的寄存器数量非常有限。
The number of parameters must not exceed six, besides the
system call number passed in eax,
because 80 × 86 processors have a very limited number of
registers.
第一个条件始终为真,因为根据 POSIX 标准,无法存储在 32 位寄存器中的大参数必须通过引用传递。一个典型的例子是settimeofday( )系统调用,它必须读取64位结构。
The first condition is always true because, according to the POSIX
standard, large parameters that cannot be stored in a 32-bit register
must be passed by reference. A typical example is the settimeofday( ) system call, which must read a
64-bit structure.
但是,存在需要六个以上参数的系统调用。在这种情况下,单个寄存器用于指向进程地址空间中包含参数值的存储区域。当然,程序员不必关心这个解决方法。与每个 C 函数调用一样,调用包装器例程时,参数会自动保存在堆栈中。该例程将找到适当的方法将参数传递给内核。
However, system calls that require more than six parameters exist. In such cases, a single register is used to point to a memory area in the process address space that contains the parameter values. Of course, programmers do not have to care about this workaround. As with every C function call, parameters are automatically saved on the stack when the wrapper routine is invoked. This routine will find the appropriate way to pass the parameters to the kernel.
用于存储系统调用号及其参数的寄存器按升序排列为eax(对于系统调用号)、ebx、ecx、
edx、esi、edi和ebp。如前所述,system_call( )并sysenter_entry( )使用SAVE_ALL宏将这些寄存器的值保存在内核模式堆栈上。因此,当系统调用服务例程进入堆栈时,它会找到返回地址 to
system_call( )或 to sysenter_entry( ),然后是存储在(系统调用的第一个参数)中的参数ebx,存储在 中的参数ecx,依此类推(请参阅“保存中断处理程序的寄存器”中第 4 章)。该堆栈配置与普通函数调用中的完全相同,因此服务例程可以使用常用的 C 语言结构轻松引用其参数。
The registers used to store the system call number and its
parameters are, in increasing order, eax (for the system call number), ebx, ecx,
edx, esi, edi,
and ebp. As seen before, system_call( ) and sysenter_entry( ) save the values of these
registers on the Kernel Mode stack by using the SAVE_ALL macro. Therefore, when the system
call service routine goes to the stack, it finds the return address to
system_call( ) or to sysenter_entry( ), followed by the parameter
stored in ebx (the first parameter of
the system call), the parameter stored in ecx, and so on (see the section "Saving the registers for the
interrupt handler" in Chapter
4). This stack configuration is exactly the same as in an
ordinary function call, and therefore the service routine can easily
refer to its parameters by using the usual C-language constructs.
让我们看一个例子。处理系统调用的服务sys_write(
)例程write( )声明为:
Let's look at an example. The sys_write(
) service routine, which handles the write( ) system call, is declared as:
int sys_write(无符号 int fd、const char * buf、无符号 int 计数)
int sys_write (unsigned int fd, const char * buf, unsigned int count)
C 编译器生成一个汇编语言函数,该函数期望在堆栈顶部、返回地址正下方、分别用于保存 、和寄存器内容的位置中找到 、fd和buf参数。countebxecxedx
The C compiler produces an assembly language function that expects
to find the fd, buf, and count parameters on top of the stack, right
below the return address, in the locations used to save the contents of
the ebx, ecx, and edx registers, respectively.
在某些情况下,即使系统调用不使用任何参数,相应的服务例程也需要在发出系统调用之前知道CPU寄存器的内容。例如,do_fork( )实现的函数fork( )需要知道寄存器的值,以便在子进程字段中复制它们
(请参阅第 3 章中的“线程字段”
thread部分)。在这些情况下,单个类型参数
允许服务例程访问由宏保存在内核模式堆栈中的值(请参阅第 4 章中的“ do_IRQ( ) 函数”部分):pt_regsSAVE_ALL
In a few cases, even if the system call doesn't use any
parameters, the corresponding service routine needs to know the contents
of the CPU registers right before the system call was issued. For
example, the do_fork( ) function that
implements fork( ) needs to know the
value of the registers in order to duplicate them in the child process
thread field (see the section "The thread field" in
Chapter 3). In these cases, a
single parameter of type pt_regs
allows the service routine to access the values saved in the Kernel Mode
stack by the SAVE_ALL macro (see the
section "The do_IRQ( )
function" in Chapter
4):
int sys_fork (struct pt_regs regs)
int sys_fork (struct pt_regs regs)
服务程序的返回值必须写入寄存器
eax。
这是在执行指令时由 C 编译器自动完成的。return
n ;
The return value of a service routine must be written into the
eax register. This is automatically
done by the C compiler when a return
n ;
instruction is executed.
在内核尝试满足用户请求之前,必须仔细检查所有系统调用参数。检查的类型取决于系统调用和特定参数。让我们回到write( )
之前介绍过的系统调用:fd参数应该是标识特定文件的文件描述符,因此sys_write( )必须检查是否fd确实是先前打开的文件的文件描述符以及是否允许进程写入该文件(参见“文件处理”一节)第 1 章中的“系统调用” )。如果其中任何一个条件不成立,则处理程序必须返回负值 - 在本例中为错误代码-EBADF。
All system call parameters must be carefully checked
before the kernel attempts to satisfy a user request. The type of
check depends both on the system call and on the specific parameter.
Let's go back to the write( )
system call introduced before: the fd parameter should be a file descriptor
that identifies a specific file, so sys_write( ) must check whether fd really is a file descriptor of a file
previously opened and whether the process is allowed to write into it
(see the section "File-Handling System
Calls" in Chapter 1).
If any of these conditions are not true, the handler must return a
negative value—in this case, the error code -EBADF.
然而,一种类型的检查对于所有系统调用都是通用的。每当参数指定一个地址时,内核必须检查它是否在进程地址空间内。有两种可能的方法来执行此检查:
One type of checking, however, is common to all system calls. Whenever a parameter specifies an address, the kernel must check whether it is inside the process address space. There are two possible ways to perform this check:
验证线性地址是否属于进程地址空间,如果是,则包含它的内存区域是否具有正确的访问权限。
Verify that the linear address belongs to the process address space and, if so, that the memory region including it has the proper access rights.
仅验证线性地址是否低于PAGE_OFFSET(即,它不在为内核保留的间隔地址范围内)。
Verify just that the linear address is lower than PAGE_OFFSET (i.e., that it doesn't fall
within the range of interval addresses reserved to the
kernel).
早期的 Linux 内核执行第一种类型的检查。但它相当耗时,因为它必须对系统调用中包含的每个地址参数执行;此外,这通常是没有意义的,因为错误的程序并不常见。
Early Linux kernels performed the first type of checking. But it is quite time consuming because it must be executed for each address parameter included in a system call; furthermore, it is usually pointless because faulty programs are not very common.
因此,从2.2版本开始,Linux采用了第二种检查方式。这更加高效,因为它不需要对进程内存区域描述符进行任何扫描。显然,这是一个非常粗略的检查:验证线性地址是否小于PAGE_OFFSET是其有效性的必要但不是充分条件。但是将内核限制为这种有限的检查并没有风险,因为稍后会发现其他错误。
Therefore, starting with Version 2.2, Linux employs the second
type of checking. This is much more efficient because it does not
require any scan of the process memory region descriptors. Obviously,
this is a very coarse check: verifying that the linear address is
smaller than PAGE_OFFSET is a
necessary but not sufficient condition for its validity. But there's
no risk in confining the kernel to this limited kind of check because
other errors will be caught later.
因此,接下来的方法是将真正的检查推迟到最后可能的时刻,即直到分页单元将线性地址转换为物理地址。我们将在“动态地址检查:修复代码”一节中讨论我们将在本章后面的成功检测到内核模式下发出的那些由用户模式进程作为参数传递的错误地址。
The approach followed is thus to defer the real checking until the last possible moment—that is, until the Paging Unit translates the linear address into a physical one. We will discuss in the section "Dynamic Address Checking: The Fix-up Code," later in this chapter, how the Page Fault exception handler succeeds in detecting those bad addresses issued in Kernel Mode that were passed as parameters by User Mode processes.
此时,人们可能想知道为什么要执行粗略检查。这种类型的检查实际上对于保护进程地址空间和内核地址空间免遭非法访问至关重要。我们在第 2 章中看到RAM 是从 开始映射的PAGE_OFFSET。这意味着内核例程能够寻址内存中存在的所有页面。因此,如果不执行粗略检查,用户模式进程可能会传递属于内核地址空间的地址作为参数,然后能够读取或写入内存中存在的每个页面,而不会导致页面错误例外。
One might wonder at this point why the coarse check is performed
at all. This type of checking is actually crucial to preserve both
process address spaces and the kernel address space from illegal
accesses. We saw in Chapter
2 that the RAM is mapped starting from PAGE_OFFSET. This means that kernel routines
are able to address all pages present in memory. Thus, if the coarse
check were not performed, a User Mode process might pass an address
belonging to the kernel address space as a parameter and then be able
to read or write every page present in memory without causing a Page
Fault exception.
对传递给系统调用的地址的检查由宏执行access_ok( ),该宏作用于两个参数:addr和
size。addr该宏检查由和
分隔的地址间隔addr + size - 1。它本质上等价于以下 C 函数:
The check on addresses passed to system calls is performed by
the access_ok( ) macro, which acts
on two parameters: addr and
size. The macro checks the address
interval delimited by addr and
addr + size - 1. It is essentially
equivalent to the following C function:
int access_ok(const void * addr, 无符号长整型)
{
无符号长 a = (无符号长) addr;
if (a + 大小 < a ||
a + size > current_thread_info( )->addr_limit.seg)
返回0;
返回1;
} int access_ok(const void * addr, unsigned long size)
{
unsigned long a = (unsigned long) addr;
if (a + size < a ||
a + size > current_thread_info( )->addr_limit.seg)
return 0;
return 1;
}该函数首先验证addr + size要检查的最高地址是否大于2 32 -1;因为无符号长整数和指针由 GNU C 编译器 ( ) 表示gcc为 32 位数字,所以这相当于检查溢出情况。该函数还检查是否超过结构体字段addr + size
中存储的值。该字段通常具有正常进程的值
和内核线程的值addr_limit.segthread_infocurrentPAGE_OFFSET0xffffffff。该字段的值可以通过和宏addr_limit.seg动态改变;这允许内核绕过由 进行的安全检查,以便它可以调用系统调用服务例程,直接向它们传递内核数据段中的地址。get_fsset_fsaccess_ok(
)
The function first verifies whether addr + size, the highest address to be
checked, is larger than 232−1; because
unsigned long integers and pointers are represented by the GNU C
compiler (gcc) as 32-bit numbers,
this is equivalent to checking for an overflow condition. The function
also checks whether addr + size
exceeds the value stored in the addr_limit.seg field of the thread_info structure of current. This field usually has the value
PAGE_OFFSET for normal processes
and the value 0xffffffff for kernel
threads . The value of the addr_limit.seg field can be dynamically
changed by the get_fs and set_fs macros; this allows the kernel to
bypass the security checks made by access_ok(
), so that it can invoke system call service routines,
directly passing to them addresses in the kernel data segment.
该verify_area( )函数执行与宏相同的检查access_ok(
);尽管该函数被认为已过时,但它仍然在源代码中广泛使用。
The verify_area( ) function
performs the same check as the access_ok(
) macro; although this function is considered obsolete, it
is still widely used in the source code.
系统调用服务例程通常需要读取或写入进程地址空间中包含的数据。Linux 包含一组宏,使这种访问变得更容易。我们将描述其中两个,称为get_user( )和put_user( )。第一个可用于从地址读取 1、2 或 4 个连续字节,而第二个可用于将这些大小的数据写入地址。
System call service routines often need to read or write
data contained in the process's address space. Linux includes a set of
macros that make this access easier. We'll describe two of them,
called get_user( ) and put_user( ). The first can be used to read
1, 2, or 4 consecutive bytes from an address, while the second can be
used to write data of those sizes into an address.
每个函数接受两个参数,一个x要传输的值和一个变量ptr。第二个变量还确定要传输的字节数。因此,在 中get_user(x,ptr),由 指向的变量的大小ptr导致函数扩展为_ _get_user_1(
)、_ _get_user_2( )或
_ _get_user_4( )汇编语言函数。让我们考虑其中之一_
_get_user_2( ):
Each function accepts two arguments, a value x to transfer and a variable ptr. The second variable also determines how
many bytes to transfer. Thus, in get_user(x,ptr), the size of the variable
pointed to by ptr causes the
function to expand into a _ _get_user_1(
), _ _get_user_2( ), or
_ _get_user_4( ) assembly language
function. Let's consider one of them, _
_get_user_2( ):
__get_user_2:
添加$1,%eax
jc bad_get_user
movl $0xffffe000, %edx /* 或 0xfffff000 对于 4 KB 堆栈 */
andl %esp, %edx
cmpl 24(%edx), %eax
jae bad_get_user
2:movzwl
-1(%eax), %edx
xorl %eax, %eax
雷特
bad_get_user:
xorl %edx, %edx
movl $-EFAULT, %eax
雷特 _ _get_user_2:
addl $1, %eax
jc bad_get_user
movl $0xffffe000, %edx /* or 0xfffff000 for 4-KB stacks */
andl %esp, %edx
cmpl 24(%edx), %eax
jae bad_get_user
2: movzwl
-1(%eax), %edx
xorl %eax, %eax
ret
bad_get_user:
xorl %edx, %edx
movl $-EFAULT, %eax
ret该寄存器包含要读取的第一个字节的eax地址。ptr前 6 个指令本质上执行与access_ok( )宏相同的检查:它们确保要读取的 2 个字节的地址小于 4 GB 以及小于addr_limit.seg进程的字段current。(该字段存储在thread_info
结构中的偏移量 24 处current,出现在cmpl
指令的第一个操作数中。)
The eax register contains the
address ptr of the first byte to be
read. The first six instructions essentially perform the same checks
as the access_ok( ) macro: they
ensure that the 2 bytes to be read have addresses less than 4 GB as
well as less than the addr_limit.seg field of the current process. (This field is stored at
offset 24 in the thread_info
structure of current, which appears
in the first operand of the cmpl
instruction.)
如果地址有效,则执行该指令,movzwl将要读取的数据存入edx寄存器的低两个字节,同时将高位字节设置edx为0;然后它设置 0 返回代码eax并终止。如果地址无效,该函数将清除edx,将-EFAULT值设置为eax,然后终止。
If the addresses are valid, the function executes the movzwl instruction to store the data to be
read in the two least significant bytes of edx register while setting the high-order
bytes of edx to 0; then it sets a 0
return code in eax and terminates.
If the addresses are not valid, the function clears edx, sets the -EFAULT value into eax, and terminates.
该put_user(x,ptr)宏与之前讨论的宏类似,只是它将值
x写入从 地址 开始的进程地址空间ptr。根据 的大小x,它调用_ _put_user_asm(
)宏(大小为 1、2 或 4 字节)或_ _put_user_u64( )宏(大小为 8 字节)。eax如果两个宏成功写入值,则在寄存器中返回值 0 ,-EFAULT
否则返回值 0。
The put_user(x,ptr) macro is
similar to the one discussed before, except it writes the value
x into the process address space
starting from address ptr.
Depending on the size of x, it
invokes either the _ _put_user_asm(
) macro (size of 1, 2, or 4 bytes) or the _ _put_user_u64( ) macro (size of 8 bytes).
Both macros return the value 0 in the eax register if they succeed in writing the
value, and -EFAULT
otherwise.
其他几个函数和宏可用于访问内核模式下的进程地址空间;它们列于表10-1中。请注意,其中许多还有一个以两个下划线 (_ _) 为前缀的变体。没有开头下划线的那些需要额外的时间来检查所请求的线性地址间隔的有效性,而带有下划线的则绕过该检查。每当内核必须重复访问进程地址空间中的同一内存区域时,更有效的方法是在开始时检查一次地址,然后访问进程区域而不进行任何进一步的检查。
Several other functions and macros are available to access the process address space in Kernel Mode; they are listed in Table 10-1. Notice that many of them also have a variant prefixed by two underscores (_ _). The ones without initial underscores take extra time to check the validity of the linear address interval requested, while the ones with the underscores bypass that check. Whenever the kernel must repeatedly access the same memory area in the process address space, it is more efficient to check the address once at the start and then access the process area without making any further checks.
表 10-1。访问进程地址空间的函数和宏
Table 10-1. Functions and macros that access the process address space
功能 Function | 行动 Action |
|---|---|
| 从用户空间读取整数值(1、2 或 4 字节) Reads an integer value from user space (1, 2, or 4 bytes) |
| 将整数值写入用户空间(1、2 或 4 字节) Writes an integer value to user space (1, 2, or 4 bytes) |
| 从用户空间复制任意大小的块 Copies a block of arbitrary size from user space |
| 将任意大小的块复制到用户空间 Copies a block of arbitrary size to user space |
| 从用户空间复制以空结尾的字符串 Copies a null-terminated string from user space |
| 返回用户空间中以空字符结尾的字符串的长度 Returns the length of a null-terminated string in user space |
| 用零填充用户空间中的内存区域 Fills a memory area in user space with zeros |
如前所述,access_ok(
)对作为系统调用参数传递的线性地址的有效性进行粗略检查。此检查仅确保用户模式进程不会尝试摆弄内核地址空间;但是,作为参数传递的线性地址仍然可能不属于进程地址空间。在这种情况下,页面错误当内核尝试使用任何此类错误地址时,将会发生异常。
As seen previously, access_ok(
) makes a coarse check on the validity of linear addresses
passed as parameters of a system call. This check only ensures that
the User Mode process is not attempting to fiddle with the kernel
address space; however, the linear addresses passed as parameters
still might not belong to the process address space. In this case, a
Page Fault exception will occur when the kernel tries to use any
of such bad addresses.
在描述内核如何检测此类错误之前,我们先明确一下内核模式下可能发生缺页异常的三种情况。页面错误处理程序必须区分这些情况,因为要采取的操作非常不同。
Before describing how the kernel detects this type of error, let's specify the three cases in which Page Fault exceptions may occur in Kernel Mode. These cases must be distinguished by the Page Fault handler, because the actions to be taken are quite different.
内核尝试寻址属于进程地址空间的页面,但相应的页框不存在或者内核尝试写入只读页面。在这些情况下,处理程序必须分配并初始化一个新的页框(请参阅第 9 章中的“请求分页”和“写入时复制”部分)。
The kernel attempts to address a page belonging to the process address space, but either the corresponding page frame does not exist or the kernel tries to write a read-only page. In these cases, the handler must allocate and initialize a new page frame (see the sections "Demand Paging" and "Copy On Write" in Chapter 9).
内核对属于其地址空间的页进行寻址,但相应的页表项尚未初始化(请参阅第 9 章中的“处理非连续内存区域访问”一节)。在这种情况下,内核必须在当前进程的页表中正确设置一些条目。
The kernel addresses a page belonging to its address space, but the corresponding Page Table entry has not yet been initialized (see the section "Handling Noncontiguous Memory Area Accesses" in Chapter 9). In this case, the kernel must properly set up some entries in the Page Tables of the current process.
某些内核函数包含一个编程错误,该错误会导致在执行该程序时引发异常;或者,异常可能是由暂时性硬件错误引起的。发生这种情况时,处理程序必须执行内核 oops(请参阅第 9 章中的“处理地址空间内的错误地址”部分)。
Some kernel functions include a programming bug that causes the exception to be raised when that program is executed; alternatively, the exception might be caused by a transient hardware error. When this occurs, the handler must perform a kernel oops (see the section "Handling a Faulty Address Inside the Address Space" in Chapter 9).
本章介绍的情况是:系统调用服务例程尝试读取或写入地址已作为系统调用参数传递的内存区域,但该地址不属于进程地址空间。
The case introduced in this chapter: a system call service routine attempts to read or write into a memory area whose address has been passed as a system call parameter, but that address does not belong to the process address space.
页面错误处理程序可以通过确定错误的线性地址是否包含在一个内存区域中来轻松识别第一种情况属于该进程。它还能够通过检查相应的主内核页表条目是否包含映射该地址的正确非空条目来检测第二种情况。现在让我们解释一下处理程序如何区分其余两种情况。
The Page Fault handler can easily recognize the first case by determining whether the faulty linear address is included in one of the memory regions owned by the process. It is also able to detect the second case by checking whether the corresponding master kernel Page Table entry includes a proper non-null entry that maps the address. Let's now explain how the handler distinguishes the remaining two cases.
确定页面错误来源的关键在于内核用于访问进程地址空间的调用范围较窄。仅使用上一节中描述的一小部分函数和宏来访问此地址空间;因此,如果异常是由无效参数引起的,则引起异常的指令必须包含在函数之一中,或者通过扩展宏之一来生成。寻址用户空间的指令数量相当少。
The key to determining the source of a Page Fault lies in the narrow range of calls that the kernel uses to access the process address space. Only the small group of functions and macros described in the previous section are used to access this address space; thus, if the exception is caused by an invalid parameter, the instruction that caused it must be included in one of the functions or else be generated by expanding one of the macros. The number of the instructions that address user space is fairly small.
因此,不需要花太多功夫,就把每条访问进程地址空间的内核指令的地址放入一个称为异常表的结构中。如果我们成功地做到了这一点,那么剩下的事情就很容易了。当内核模式下发生页面错误异常时,do_ page_fault(
)处理程序会检查异常表:如果它包含触发异常的指令的地址,则错误是由错误的系统调用参数引起的;否则就是更严重的bug造成的。
Therefore, it does not take much effort to put the address of
each kernel instruction that accesses the process address space into a
structure called the exception table. If we
succeed in doing this, the rest is easy. When a Page Fault exception
occurs in Kernel Mode, the do_ page_fault(
) handler examines the exception table: if it includes the
address of the instruction that triggered the exception, the error is
caused by a bad system call parameter; otherwise, it is caused by a
more serious bug.
Linux定义了几个异常表。主异常表是由C编译器在构建内核程序映像时自动生成的。它存储在_ _ex_table内核代码段的部分中,其起始地址和结束地址由C编译器产生的两个符号来标识:_ _start_ _ _ex_table和_ _stop_ _ _ex_table。
Linux defines several exception tables . The main exception table is automatically generated
by the C compiler when building the kernel program image. It is stored
in the _ _ex_table section of the
kernel code segment, and its starting and ending addresses are
identified by two symbols produced by the C compiler: _ _start_ _ _ex_table and _ _stop_ _ _ex_table.
此外,内核的每个动态加载模块(参见 附录B)都包含其自己的本地异常表。该表由 C 编译器在构建模块映像时自动生成,并在模块插入正在运行的内核时加载到内存中。
Moreover, each dynamically loaded module of the kernel (see Appendix B) includes its own local exception table. This table is automatically generated by the C compiler when building the module image, and it is loaded into memory when the module is inserted in the running kernel.
异常表的每个条目都是一个exception_table_entry具有两个字段的结构:
Each entry of an exception table is an exception_table_entry structure that has two
fields:
insninsn访问进程地址空间的指令的线性地址
The linear address of an instruction that accesses the process address space
fixupfixupinsn位于的指令触发缺页异常时调用的汇编语言代码的地址
The address of the assembly language code to be invoked
when a Page Fault exception triggered by the instruction located
at insn occurs
修复代码由一些汇编语言指令组成,用于解决异常触发的问题。正如我们将在本节后面看到的,修复通常包括插入一系列指令,强制服务例程向用户模式进程返回错误代码。这些指令通常在访问进程地址空间的同一宏或函数中定义,由 C 编译器放置到称为 的内核代码段的单独部分中.fixup。
The fixup code consists of a few assembly language instructions
that solve the problem triggered by the exception. As we will see
later in this section, the fix usually consists of inserting a
sequence of instructions that forces the service routine to return an
error code to the User Mode process. These instructions, which are
usually defined in the same macro or function that accesses the
process address space, are placed by the C compiler into a separate
section of the kernel code segment called .fixup.
该search_exception_tables(
)函数用于在所有异常表中搜索指定地址:如果该地址包含在表中,则该函数返回指向相应exception_table_entry结构的指针;否则,返回NULL. 因此,页面错误处理程序do_page_fault( )
执行以下语句:
The search_exception_tables(
) function is used to search for a specified address in all
exception tables: if the address is included in a table, the function
returns a pointer to the corresponding exception_table_entry structure; otherwise,
it returns NULL. Thus the Page
Fault handler do_page_fault( )
executes the following statements:
if ((fixup = search_exception_tables(regs->eip))) {
regs->eip = 修复->修复;
返回1;
} if ((fixup = search_exception_tables(regs->eip))) {
regs->eip = fixup->fixup;
return 1;
}该字段包含
异常发生时保存在内核模式堆栈上的寄存器regs->eip的值。eip如果寄存器(指令指针)中的值位于异常表中,do_page_fault( )
则将保存的值替换为在返回的条目中找到的地址search_exception_tables( )。然后页面错误处理程序终止,中断的程序恢复执行修复代码。
The regs->eip field
contains the value of the eip
register saved on the Kernel Mode stack when the exception occurred.
If the value in the register (the instruction pointer) is in an
exception table, do_page_fault( )
replaces the saved value with the address found in the entry returned
by search_exception_tables( ). Then
the Page Fault handler terminates and the interrupted program resumes
with execution of the fixup code .
GNU 汇编器.section指令允许程序员指定可执行文件的哪个部分包含后面的代码。正如我们将在第 20 章中看到的,可执行文件包含一个代码段,而代码段又可以细分为多个部分。因此,以下汇编语言指令将一个条目添加到异常表中;该"a"属性指定该部分必须与内核映像的其余部分一起加载到内存中:
The GNU Assembler .section directive allows programmers to
specify which section of the executable file contains the code that
follows. As we will see in Chapter
20, an executable file includes a code segment, which in turn
may be subdivided into sections. Thus, the following assembly language
instructions add an entry into an exception table; the "a" attribute specifies that the section
must be loaded into memory together with the rest of the kernel
image:
.section _ _ex_table, "a"
.long faulty_instruction_address, fixup_code_address
.previous .section _ _ex_table, "a"
.longfaulty_instruction_address, fixup_code_address
.previous该.previous指令强制汇编器将后面的代码插入到.section遇到最后一个指令时处于活动状态的部分中。
The .previous directive
forces the assembler to insert the code that follows into the section
that was active when the last .section directive was encountered.
让我们再次考虑一下前面提到的_ _get_user_1(
)、_ _get_user_2( )、 和
_ _get_user_4( )函数。访问进程地址空间的指令标记为1、
2和3:
Let's consider again the _ _get_user_1(
), _ _get_user_2( ), and
_ _get_user_4( ) functions
mentioned before. The instructions that access the process address
space are those labeled as 1,
2, and 3:
__get_user_1:
[...]
1: movzbl (%eax), %edx
[...]
__get_user_2:
[...]
2: movzwl -1(%eax), %edx
[...]
__get_user_4:
[...]
3: movl -3(%eax), %edx
[...]
bad_get_user:
xorl %edx, %edx
movl $-EFAULT, %eax
雷特
.section _ _ex_table,"a"
.long 1b,bad_get_user
.long 2b,bad_get_user
.long 3b,bad_get_user
。以前的 _ _get_user_1:
[...]
1: movzbl (%eax), %edx
[...]
_ _get_user_2:
[...]
2: movzwl -1(%eax), %edx
[...]
_ _get_user_4:
[...]
3: movl -3(%eax), %edx
[...]
bad_get_user:
xorl %edx, %edx
movl $-EFAULT, %eax
ret
.section _ _ex_table,"a"
.long 1b, bad_get_user
.long 2b, bad_get_user
.long 3b, bad_get_user
.previous每个异常表条目由两个标签组成。第一个是带有后缀的数字标签,b表示该标签是“向后”的;换句话说,它出现在程序的前一行。修复代码对于这三个函数是通用的,并标记为bad_get_user。如果发生页面错误1异常由标签、2、 或处的指令生成
3,执行修复代码。它只是-EFAULT向发出系统调用的进程返回一个错误代码。
Each exception table entry consists of two labels. The first one
is a numeric label with a b suffix
to indicate that the label is "backward;" in other words, it appears
in a previous line of the program. The fixup code is common to the
three functions and is labeled as bad_get_user. If a Page Fault exception is generated by the instructions at label
1, 2, or 3,
the fixup code is executed. It simply returns an -EFAULT error code to the process that
issued the system call.
在用户模式地址空间中起作用的其他内核函数使用修复代码技术。例如,考虑一下strlen_user(string)宏观。该宏返回在系统调用中作为参数传递的以 null 结尾的字符串的长度,或者在出错时返回值 0。该宏本质上产生以下汇编语言指令:
Other kernel functions that act in the User Mode address space
use the fixup code technique. Consider, for instance, the strlen_user(string) macro. This macro
returns either the length of a null-terminated string passed as a
parameter in a system call or the value 0 on error. The macro
essentially yields the following assembly language
instructions:
movl $0, %eax
movl $0x7fffffff, %ecx
movl%ecx,%ebx
movl 字符串,%edi
0:重复;斯卡斯布
子 %ecx、%ebx
movl %ebx, %eax
1:
.section .fixup,"ax"
2: xorl %eax, %eax
跳转1b
。以前的
.section _ _ex_table,"a"
.长0b、2b
。以前的 movl $0, %eax
movl $0x7fffffff, %ecx
movl %ecx, %ebx
movl string, %edi
0: repne; scasb
subl %ecx, %ebx
movl %ebx, %eax
1:
.section .fixup,"ax"
2: xorl %eax, %eax
jmp 1b
.previous
.section _ _ex_table,"a"
.long 0b, 2b
.previous和寄存器使用该值进行初始化
,该值表示用户模式地址空间中字符串允许的最大长度ecx。汇编语言指令迭代地扫描寄存器指向的字符串
,在 中查找值 0(字符串字符的结尾)。因为每次迭代都会减少寄存器,所以寄存器最终存储的是字符串中扫描到的字节总数(即字符串的长度)。ebx0x7fffffffrepne;scasbedi\0eaxscasbecxeax
The ecx and ebx registers are initialized with the
0x7fffffff value, which represents
the maximum allowed length for the string in the User Mode address
space. The repne;scasb assembly
language instructions iteratively scan the string pointed to by the
edi register, looking for the value
0 (the end of string \0 character)
in eax. Because scasb decreases the ecx register at each iteration, the eax register ultimately stores the total
number of bytes scanned in the string (that is, the length of the
string).
宏的修复代码被插入到该.fixup部分中。这些"ax"属性指定该节必须加载到内存中并且它包含可执行代码。如果 label 处的指令产生页面错误异常
0,则执行修复代码;它只是加载值 0 eax,从而强制宏返回 0 错误代码而不是字符串长度,然后跳转到标签1,该标签对应于宏后面的指令。
The fixup code of the macro is inserted into the .fixup section. The "ax" attributes specify that the section
must be loaded into memory and that it contains executable code. If a
Page Fault exception is generated by the instructions at label
0, the fixup code is executed; it
simply loads the value 0 in eax—thus forcing the macro to return a 0
error code instead of the string length—and then jumps to the 1 label, which corresponds to the
instruction following the macro.
第二个.section指令添加一个条目,其中包含指令的地址和该部分repne; scasb中相应修复代码的地址。_
_ex_table
The second .section directive
adds an entry containing the address of the repne; scasb instruction and the address of
the corresponding fixup code in the _
_ex_table section.
虽然系统调用主要由用户态进程使用,但它们也可以由内核线程调用,不能使用库函数。简化相应包装例程的声明,Linux定义了一组七个宏,称为_syscall0通过_syscall6。
Although system calls are used mainly by User Mode
processes, they can also be invoked by kernel threads , which cannot use library functions. To simplify the
declarations of the corresponding wrapper routines , Linux defines a set of seven macros called _syscall0 through _syscall6.
在每个宏的名称中,数字0到6对应于系统调用使用的参数数量(不包括系统调用号)。这些宏用于声明尚未包含在libc标准库中的包装例程(例如,因为该库尚不支持 Linux 系统调用);但是,它们不能用于为具有六个以上参数(不包括系统调用号)的系统调用或产生非标准返回值的系统调用定义包装器例程。
In the name of each macro, the numbers 0 through 6 correspond to the number of parameters used by the system call (excluding the system call number). The macros are used to declare wrapper routines that are not already included in the libc standard library (for instance, because the Linux system call is not yet supported by the library); however, they cannot be used to define wrapper routines for system calls that have more than six parameters (excluding the system call number) or for system calls that yield nonstandard return values.
每个宏正好需要 2 + 2 × n 个
参数,其中n是系统调用的参数数量。前两个参数指定返回类型和系统调用的名称;每对附加参数指定相应系统调用参数的类型和名称。因此,例如,包装例程fork( ) 系统调用可能由以下方式生成:
Each macro requires exactly 2 + 2 × n
parameters, with n being the number of parameters
of the system call. The first two parameters specify the return type and
the name of the system call; each additional pair of parameters
specifies the type and the name of the corresponding system call
parameter. Thus, for instance, the wrapper routine of the fork( ) system call may be generated by:
_syscall0(int,fork)
_syscall0(int,fork)
while the wrapper routine of the write(
) system call may be generated by:
_syscall3(int,write,int,fd,const char *,buf,无符号整数,计数)
_syscall3(int,write,int,fd,const char *,buf,unsigned int,count)
在后一种情况下,宏生成以下代码:
In the latter case, the macro yields the following code:
int write(int fd,const char * buf,无符号 int 计数)
{
长__res;
asm(“int $0x80”
:“=a”(_ _res)
:“0”(_ _NR_write),“b”((长)fd),
“c”((长)buf),“d”((长)计数));
if ((unsigned long)_ _res >= (unsigned long)-129) {
错误号 = -_ _res;
_ _res = -1;
}
返回(int)__res;
} int write(int fd,const char * buf,unsigned int count)
{
long _ _res;
asm("int $0x80"
: "=a" (_ _res)
: "0" (_ _NR_write), "b" ((long)fd),
"c" ((long)buf), "d" ((long)count));
if ((unsigned long)_ _res >= (unsigned long)-129) {
errno = -_ _res;
_ _res = -1;
}
return (int) _ _res;
}该_ _NR_write宏源自_syscall3;的第二个参数。它扩展为 的系统调用号write( )。编译上述函数时,会生成以下汇编语言代码:
The _ _NR_write macro is
derived from the second parameter of _syscall3; it expands into the system call
number of write( ). When compiling
the preceding function, the following assembly language code is
produced:
写:
推 %ebx ;将 ebx 压入堆栈
movl 8(%esp),%ebx; 将第一个参数放入 ebx
movl 12(%esp),%ecx; 将第二个参数放入ecx中
movl 16(%esp),%edx; 将第三个参数放入 edx 中
movl $4, %eax ; 将 _ _NR_write 放入 eax 中
整数
$0x80 ;调用系统调用
cmpl $-125, %eax ; 检查返回码
jbe .L1 ; 如果没有错误,则跳转
负 %eax ;对 eax 的值求补
movl %eax, errno ; 将结果放入 errno
movl $-1, %eax ; 将 eax 设置为 -1
.L1: popl %ebx ; 从堆栈中弹出 ebx
视网膜;返回调用程序 write:
pushl %ebx ; push ebx into stack
movl 8(%esp), %ebx ; put first parameter in ebx
movl 12(%esp), %ecx ; put second parameter in ecx
movl 16(%esp), %edx ; put third parameter in edx
movl $4, %eax ; put _ _NR_write in eax
int
$0x80 ; invoke system call
cmpl $-125, %eax ; check return code
jbe .L1 ; if no error, jump
negl %eax ; complement the value of eax
movl %eax, errno ; put result in errno
movl $-1, %eax ; set eax to -1
.L1: popl %ebx ; pop ebx from stack
ret ; return to calling program请注意在执行指令之前函数的参数write(
)是如何加载到 CPU 寄存器中的
。int $0x80如果返回的值eax介于 -1 和 -129 之间,则必须将其解释为错误代码(内核假定
include/generic/errno.h中定义的最大错误代码为 129)。-eax如果是这种情况,包装例程将存储in的值errno并返回值 -1;否则,它返回 的值eax。
Notice how the parameters of the write(
) function are loaded into the CPU registers before the
int $0x80 instruction is executed.
The value returned in eax must be
interpreted as an error code if it lies between -1 and -129 (the kernel
assumes that the largest error code defined in
include/generic/errno.h is 129). If this is the
case, the wrapper routine stores the value of -eax in errno and returns the value -1; otherwise, it
returns the value of eax.
第一个 Unix 系统引入了信号,以允许用户模式进程之间的交互;内核还使用它们来向进程通知系统事件。信号已经存在了 30 年,只发生了微小的变化。
Signals were introduced by the first Unix systems to allow interactions between User Mode processes; the kernel also uses them to notify processes of system events. Signals have been around for 30 years with only minor changes.
本章的第一部分详细研究了信号如何由 Linux 内核处理,然后我们讨论允许进程交换信号的系统调用。
The first sections of this chapter examine in detail how signals are handled by the Linux kernel, then we discuss the system calls that allow processes to exchange signals.
信号是非常短的消息,可以发送到一个进程或一组进程。提供给进程的唯一信息通常是标识信号的数字;标准信号中没有空间用于参数、消息或其他附带信息。
A signal is a very short message that may be sent to a process or a group of processes. The only information given to the process is usually a number identifying the signal; there is no room in standard signals for arguments, a message, or other accompanying information.
一组名称以前缀开头的宏SIG用于标识信号;我们已经在前面的章节中提到过它们。例如,第 3 章的“ clone()、fork() 和 vfork() 系统调用”SIGCHLD一节中提到了该宏。该宏在 Linux 中扩展为值 17,生成当子进程停止或终止时发送到父进程的信号标识符。该宏扩展为值 11,在第 9 章“页面错误异常处理程序”一节中提到过SIGSEGV; 它产生当进程进行无效内存引用时发送到进程的信号的标识符。
A set of macros whose names start with the prefix SIG is used to identify signals; we have
already made a few references to them in previous chapters. For
instance, the SIGCHLD macro was
mentioned in the section "The clone( ), fork( ), and vfork(
) System Calls" in Chapter
3. This macro, which expands into the value 17 in Linux, yields
the identifier of the signal that is sent to a parent process when a
child stops or terminates. The SIGSEGV macro, which expands into the value
11, was mentioned in the section "Page Fault Exception Handler"
in Chapter 9; it yields the
identifier of the signal that is sent to a process when it makes an
invalid memory reference.
信号有两个主要目的:
Signals serve two main purposes:
使进程知道发生了特定事件
To make a process aware that a specific event has occurred
使进程执行其代码中包含的信号处理函数
To cause a process to execute a signal handler function included in its code
当然,这两个目的并不相互排斥,因为进程通常必须通过执行特定的例程来对某些事件做出反应。
Of course, the two purposes are not mutually exclusive, because often a process must react to some event by executing a specific routine.
表 11-1
列出了 Linux 2.6 针对 80×86 架构处理的前 31 个信号(某些信号编号,例如与SIGCHLD或相关的信号编号SIGSTOP,是与架构相关的;此外,某些信号(例如 )SIGSTKFLT仅针对特定架构定义)。默认操作的含义将在下一节中描述。
Table 11-1
lists the first 31 signals handled by Linux 2.6 for the 80×86
architecture (some signal numbers, such those associated with SIGCHLD or SIGSTOP, are architecture-dependent;
furthermore, some signals such as SIGSTKFLT are defined only for specific
architectures). The meanings of the default actions are described in the
next section.
表 11-1。Linux/i386 中的前 31 个信号
Table 11-1. The first 31 signals in Linux/i386
# # | 信号名称 Signal name | 默认操作 Default action | 评论 Comment | POSIX POSIX |
|---|---|---|---|---|
1 1 | | 终止 Terminate | 挂起控制终端或进程 Hang up controlling terminal or process | 是的 Yes |
2 2 | | 终止 Terminate | 来自键盘的中断 Interrupt from keyboard | 是的 Yes |
3 3 | | 倾倒 Dump | 从键盘退出 Quit from keyboard | 是的 Yes |
4 4 | | 倾倒 Dump | 非法指令 Illegal instruction | 是的 Yes |
5 5 | | 倾倒 Dump | 断点调试 Breakpoint for debugging | 不 No |
6 6 | | 倾倒 Dump | 异常终止 Abnormal termination | 是的 Yes |
6 6 | | 倾倒 Dump | 相当于 Equivalent to | 不 No |
7 7 | | 倾倒 Dump | 总线错误 Bus error | 不 No |
8 8 | | 倾倒 Dump | 浮点异常 Floating-point exception | 是的 Yes |
9 9 | | 终止 Terminate | 强制进程终止 Forced-process termination | 是的 Yes |
10 10 | | 终止 Terminate | 可供进程使用 Available to processes | 是的 Yes |
11 11 | | 倾倒 Dump | 无效的内存引用 Invalid memory reference | 是的 Yes |
12 12 | | 终止 Terminate | 可供进程使用 Available to processes | 是的 Yes |
13 13 | | 终止 Terminate | 在没有读者的情况下写入管道 Write to pipe with no readers | 是的 Yes |
14 14 | | 终止 Terminate | 实时时钟 Real-timerclock | 是的 Yes |
15 15 | | 终止 Terminate | 进程终止 Process termination | 是的 Yes |
16 16 | | 终止 Terminate | 协处理器堆栈错误 Coprocessor stack error | 不 No |
17 号 17 | | 忽略 Ignore | 子进程已停止或终止,或者如果被跟踪则收到信号 Child process stopped or terminated, or got signal if traced | 是的 Yes |
18 18 | | 继续 Continue | 如果停止则恢复执行 Resume execution, if stopped | 是的 Yes |
19 19 | | 停止 Stop | 停止进程执行 Stop process execution | 是的 Yes |
20 20 | | 停止 Stop | 停止从 tty 发出的进程 Stop process issued from tty | 是的 Yes |
21 21 | | 停止 Stop | 后台进程需要输入 Background process requires input | 是的 Yes |
22 22 | | 停止 Stop | 后台进程需要输出 Background process requires output | 是的 Yes |
23 23 | | 忽略 Ignore | 插座紧急情况 Urgent condition on socket | 不 No |
24 24 | | 倾倒 Dump | 超出 CPU 时间限制 CPU time limit exceeded | 不 No |
25 25 | | 倾倒 Dump | 超出文件大小限制 File size limit exceeded | 不 No |
26 26 | | 终止 Terminate | 虚拟定时器时钟 Virtual timer clock | 不 No |
27 27 | | 终止 Terminate | 配置文件定时器时钟 Profile timer clock | 不 No |
28 28 | | 忽略 Ignore | 调整窗口大小 Window resizing | 不 No |
29 29 | | 终止 Terminate | 现在可以进行 I/O I/O now possible | 不 No |
29 29 | | 终止 Terminate | 相当于 Equivalent to | 不 No |
30 30 | | 终止 Terminate | 电源故障 Power supply failure | 不 No |
31 31 | | 倾倒 Dump | 错误的系统调用 Bad system call | 不 No |
31 31 | | 倾倒 Dump | 相当于 Equivalent to | 不 No |
除了常规信号之外 如此表中所述,POSIX 标准引入了一类新的信号,称为实时信号 ; 在 Linux 上,它们的信号编号范围为 32 到 64。它们与常规信号的主要区别在于它们总是排队以便接收发送的多个信号。另一方面,同类的常规信号不会排队:如果连续多次发送常规信号,则只有其中一个被传递到接收进程。尽管Linux内核不使用实时信号,但它通过几个特定的系统调用完全支持POSIX标准。
Besides the regular signals described in this table, the POSIX standard has introduced a new class of signals denoted as real-time signals ; their signal numbers range from 32 to 64 on Linux. They mainly differ from regular signals because they are always queued so that multiple signals sent will be received. On the other hand, regular signals of the same kind are not queued: if a regular signal is sent many times in a row, just one of them is delivered to the receiving process. Although the Linux kernel does not use real-time signals, it fully supports the POSIX standard by means of several specific system calls.
许多系统调用允许程序员发送信号并确定他们的进程如何响应他们接收到的信号。表 11-2总结了这些调用;它们的行为将在后面的“与信号处理相关的系统调用”部分中详细描述。
A number of system calls allow programmers to send signals and determine how their processes respond to the signals they receive. Table 11-2 summarizes these calls; their behavior is described in detail in the later section "System Calls Related to Signal Handling."
表 11-2。与信号相关的最重要的系统调用
Table 11-2. The most significant system calls related to signals
系统调用 System call | 描述 Description |
|---|---|
向线程组发送信号 Send a signal to a thread group | |
向进程发送信号 Send a signal to a process | |
向特定线程组中的进程发送信号 Send a signal to a process in a specific thread group | |
更改与信号关联的操作 Change the action associated with a signal | |
如同 Similar to | |
检查是否有待处理信号 Check whether there are pending signals | |
修改阻塞信号集 Modify the set of blocked signals | |
等待信号 Wait for a signal | |
更改与实时信号相关的操作 Change the action associated with a real-time signal | |
检查是否有待处理的实时信号 Check whether there are pending real-time signals | |
修改阻塞实时信号集 Modify the set of blocked real-time signals | |
向线程组发送实时信号 Send a real-time signal to a thread group | |
等待实时信号 Wait for a real-time signal | |
如同 Similar to |
信号的一个重要特征是它们可以随时发送到状态通常不可预测的进程。发送到当前未执行的进程的信号必须由内核保存,直到该进程恢复执行。阻塞信号(稍后描述)需要推迟信号的传递,直到稍后解除阻塞为止,这加剧了信号在传递之前就被发出的问题。
An important characteristic of signals is that they may be sent at any time to a process whose state is usually unpredictable. Signals sent to a process that is not currently executing must be saved by the kernel until that process resumes execution. Blocking a signal (described later) requires that delivery of the signal be held off until it is later unblocked, which exacerbates the problem of signals being raised before they can be delivered.
因此,内核区分了与信号传输相关的两个不同阶段:
Therefore, the kernel distinguishes two different phases related to signal transmission:
内核更新目标进程的数据结构以表示已发送新信号。
The kernel updates a data structure of the destination process to represent that a new signal has been sent.
内核通过更改其执行状态、启动指定信号处理程序的执行或两者来强制目标进程对信号做出反应。
The kernel forces the destination process to react to the signal by changing its execution state, by starting the execution of a specified signal handler, or both.
每个生成的信号最多可以传送一次。信号是可消耗的资源:一旦它们被传递,所有引用它们先前存在的进程描述符信息都将被取消。
Each signal generated can be delivered once, at most. Signals are consumable resources: once they have been delivered, all process descriptor information that refers to their previous existence is canceled.
已生成但尚未传递的信号称为 待处理信号 。在任何时候,一个进程只能存在一个给定类型的挂起信号;同一进程的相同类型的其他挂起信号不会排队,而是简单地丢弃。不过,实时信号有所不同:可以有多个相同类型的待处理信号。
Signals that have been generated but not yet delivered are called pending signals . At any time, only one pending signal of a given type may exist for a process; additional pending signals of the same type to the same process are not queued but simply discarded. Real-time signals are different, though: there can be several pending signals of the same type.
一般来说,信号可能会在不可预测的时间内保持未决状态。必须考虑以下因素:
In general, a signal may remain pending for an unpredictable amount of time. The following factors must be taken into consideration:
信号通常只传递给当前正在运行的进程(即进程current
)。
Signals are usually delivered only to the currently running
process (that is, to the current
process).
给定类型的信号可以被 进程选择性地阻止(参见后面的“修改阻止信号集”部分)。在这种情况下,进程在删除块之前不会收到信号。
Signals of a given type may be selectively blocked by a process (see the later section "Modifying the Set of Blocked Signals"). In this case, the process does not receive the signal until it removes the block.
当进程执行信号处理函数时,它通常 会屏蔽相应的信号,即它会自动阻止信号,直到处理程序终止。因此,信号处理程序不会被处理的信号的另一个出现所中断,并且该函数不需要是可重入的。
When a process executes a signal-handler function, it usually masks the corresponding signal—i.e., it automatically blocks the signal until the handler terminates. A signal handler therefore cannot be interrupted by another occurrence of the handled signal, and the function doesn't need to be reentrant.
尽管信号的概念很直观,但内核实现却相当复杂。内核必须:
Although the notion of signals is intuitive, the kernel implementation is rather complex. The kernel must:
记住每个进程阻止了哪些信号。
Remember which signals are blocked by each process.
当从内核态切换到用户态时,检查进程的信号是否到达。几乎每次定时器中断时都会发生这种情况(大约每毫秒)。
When switching from Kernel Mode to User Mode, check whether a signal for a process has arrived. This happens at almost every timer interrupt (roughly every millisecond).
确定信号是否可以被忽略。当满足以下所有条件时,就会发生这种情况:
目标进程未被其他进程跟踪(PT_PTRACED进程描述符ptrace
字段中的标志等于 0)。[ * ]
该信号未被目标进程阻止。
该信号被目标进程忽略(因为进程明确忽略它,或者因为进程没有更改信号的默认操作并且该操作是“忽略”)。
Determine whether the signal can be ignored. This happens when all of the following conditions are fulfilled:
The destination process is not traced by another process
(the PT_PTRACED flag in the
process descriptor ptrace
field is equal to 0).[*]
The signal is not blocked by the destination process.
The signal is being ignored by the destination process (either because the process explicitly ignored it or because the process did not change the default action of the signal and that action is "ignore").
处理信号,这可能需要在执行期间的任何时候将进程切换到处理函数,并在函数返回后恢复原始执行上下文。
Handle the signal, which may require switching the process to a handler function at any point during its execution and restoring the original execution context after the function returns.
此外,Linux 必须考虑 BSD 采用的信号的不同语义和系统V; 此外,它还必须符合相当繁琐的 POSIX 要求。
Moreover, Linux must take into account the different semantics for signals adopted by BSD and System V ; furthermore, it must comply with the rather cumbersome POSIX requirements.
There are three ways in which a process can respond to a signal:
明确忽略该信号。
Explicitly ignore the signal.
执行与信号相关的默认操作(参见表11-1)。该操作由内核预定义,取决于信号类型,并且可以是以下任意一种:
Execute the default action associated with the signal (see Table 11-1). This action, which is predefined by the kernel, depends on the signal type and may be any one of the following:
The process is terminated (killed).
The process is terminated (killed) and a core file containing its execution
context is created, if possible; this file may be used for
debug purposes.
The signal is ignored.
The process is stopped—i.e., put in the TASK_STOPPED state (see the
section "Process
State" in Chapter
3).
If the process was stopped (TASK_STOPPED), it is put into the
TASK_RUNNING
state.
通过调用相应的信号处理函数来捕获信号。
Catch the signal by invoking a corresponding signal-handler function.
请注意,阻止信号与忽略信号不同。只要信号被阻塞,就无法传递;只有在解锁后才会交付。始终会传递被忽略的信号,并且不会采取进一步的操作。
Notice that blocking a signal is different from ignoring it. A signal is not delivered as long as it is blocked; it is delivered only after it has been unblocked. An ignored signal is always delivered, and there is no further action.
和信号不能被忽略、捕获或阻止,并且它们的默认SIGKILL操作SIGSTOP必须始终被执行。SIGKILL因此,SIGSTOP允许具有适当权限的用户分别终止和停止每个进程,[ * ],无论其正在执行的程序采取的防御措施如何。
The SIGKILL and SIGSTOP signals cannot be ignored, caught,
or blocked, and their default actions must always be executed. Therefore, SIGKILL and SIGSTOP allow a user with appropriate
privileges to terminate and to stop, respectively, every
process,[*] regardless of the defenses taken by the program it is
executing.
如果传递信号导致内核终止进程,则该信号对于给定进程来说是致命的。信号
SIGKILL总是致命的;此外,每个默认操作为“终止”并且未被进程捕获的信号对于该进程也是致命的。但请注意,进程捕获的信号及其相应的信号处理函数终止进程并不是致命的,因为进程选择终止自身而不是被内核杀死。
A signal is fatal for a given process if
delivering the signal causes the kernel to kill the process. The
SIGKILL signal is always fatal;
moreover, each signal whose default action is "Terminate" and which is
not caught by a process is also fatal for that process. Notice,
however, that a signal caught by a process and whose corresponding
signal-handler function terminates the process is not fatal, because
the process chose to terminate itself rather than being killed by the
kernel.
POSIX 1003.1 标准对多线程应用程序的信号处理有一些严格的要求:
The POSIX 1003.1 standard has some stringent requirements for signal handling of multithreaded applications:
信号处理程序必须在多线程应用程序的所有线程之间共享;但是,每个线程必须有自己的挂起和阻塞信号掩码。
Signal handlers must be shared among all threads of a multithreaded application; however, each thread must have its own mask of pending and blocked signals.
这kill( ) 和sigqueue( )
POSIX 库函数(请参阅后面的“与信号处理相关的系统调用”部分)必须向整个多线程应用程序发送信号,而不是向特定线程发送信号。这同样适用于内核生成的所有信号(例如SIGCHLD、SIGINT或)。SIGQUIT
The kill( ) and sigqueue( )
POSIX library functions (see the later section
"System Calls Related
to Signal Handling") must send signals to whole
multithreaded applications, not to a specific thread. The same
holds for all signals (such as SIGCHLD, SIGINT, or SIGQUIT) generated by the kernel.
发送到多线程应用程序的每个信号将仅传递到一个线程,该线程是由内核在不阻塞该信号的线程中任意选择的。
Each signal sent to a multithreaded application will be delivered to just one thread, which is arbitrarily chosen by the kernel among the threads that are not blocking that signal.
如果致命信号被发送到多线程应用程序,内核将杀死应用程序的所有线程,而不仅仅是已传递信号的线程。
If a fatal signal is sent to a multithreaded application, the kernel will kill all threads of the application—not just the thread to which the signal has been delivered.
为了符合 POSIX 标准,Linux 2.6 内核将多线程应用程序实现为一组属于同一线程组的轻量级进程(请参阅第 3 章中的“进程、轻量级进程和线程”部分)。
In order to comply with the POSIX standard, the Linux 2.6 kernel implements a multithreaded application as a set of lightweight processes belonging to the same thread group (see the section "Processes, Lightweight Processes, and Threads" in Chapter 3).
在本章中,术语“线程组”表示任何线程组,即使它是由单个(传统)进程组成的。例如,当我们声明kill(
)可以向线程组发送信号时,我们暗示该系统调用也可以向常规进程发送信号。我们将使用术语“进程”来表示传统进程或轻量级进程,即线程组的特定成员。
In this chapter the term "thread group" denotes any thread
group, even if it is composed by a single (conventional) process. For
instance, when we state that kill(
) can send a signal to a thread group, we imply that this
system call can send a signal to a conventional process, too. We will
use the term "process" to denote either a conventional process or a
lightweight process—that is, a specific member of a thread
group.
此外,如果待处理信号已发送到特定进程,则该信号是私有的;如果它已发送到整个线程组,则它是 共享的。
Furthermore, a pending signal is private if it has been sent to a specific process; it is shared if it has been sent to a whole thread group.
对于系统中的每个进程,内核必须跟踪当前待处理或屏蔽的信号;内核还必须跟踪每个线程组应该如何处理每个信号。为此,内核使用了几种数据结构可从进程描述符访问。最重要的如图 11-1所示。
For each process in the system, the kernel must keep track of what signals are currently pending or masked; the kernel must also keep track of how every thread group is supposed to handle every signal. To do this, the kernel uses several data structures accessible from the process descriptor. The most significant ones are shown in Figure 11-1.
表11-3列出了与信号处理相关的进程描述符的字段。
The fields of the process descriptor related to signal handling are listed in Table 11-3.
表 11-3。与信号处理相关的进程描述符字段
Table 11-3. Process descriptor fields related to signal handling
该blocked字段存储当前被进程屏蔽的信号。它是一个sigset_t位数组,每个位对应一种信号类型:
The blocked field stores the
signals currently masked out by the process. It is a sigset_t array of bits, one for each signal
type:
类型定义结构{
无符号长 sig[2];
} sigset_t; typedef struct {
unsigned long sig[2];
} sigset_t;因为每个unsigned long
数字由32位组成,所以Linux中可以声明的信号的最大数量是64(_NSIG
宏指定了这个值)。任何信号都不能有编号 0,因此信号编号对应于变量中相应位的索引
sigset_t加一。1到31之间的数字对应于表11-1中列出的信号,而32到64之间的数字对应于实时信号。
Because each unsigned long
number consists of 32 bits, the maximum number of signals that may be
declared in Linux is 64 (the _NSIG
macro specifies this value). No signal can have number 0, so the
signal number corresponds to the index of the corresponding bit in a
sigset_t variable plus one. Numbers
between 1 and 31 correspond to the signals listed in Table 11-1, while numbers
between 32 and 64 correspond to real-time signals.
进程描述符的字段signal指向信号描述符,这是一个signal_struct跟踪共享挂起信号的结构。实际上,信号描述符还包括与信号处理并不严格相关的字段,例如
rlim每个进程的资源限制数组(请参阅第3章中的“进程资源限制”部分),或者存储组的PID的和
字段分别是进程的领导者和会话领导者(参见第 3 章中的“进程之间的关系”部分)pgrpsession)。事实上,正如“clone()、fork() 和 vfork() 系统调用”在第 3 章中,信号描述符由属于同一线程组的所有进程共享——即通过调用设置了标志clone( )的系统调用创建的所有进程CLONE_THREAD——因此信号描述符包含对于同一线程组中的每个进程必须相同的字段。
The signal field of
the process descriptor points to a signal
descriptor, a signal_struct structure that keeps track
of the shared pending signals. Actually, the signal descriptor also
includes fields not strictly related to signal handling, such as the
rlim per-process resource limit
array (see the section "Process Resource
Limits" in Chapter
3), or the pgrp and
session fields, which store the
PIDs of the group leader and of the session leader of the process,
respectively (see the section "Relationships Among
Processes" in Chapter
3). In fact, as mentioned in the section "The clone(
) , fork( ), and vfork( ) System Calls" in Chapter 3, the signal descriptor
is shared by all processes belonging to the same thread group—that
is, all processes created by invoking the clone( ) system call with the CLONE_THREAD flag set—thus the signal
descriptor includes the fields that must be identical for every
process in the same thread group.
The fields of a signal descriptor somewhat related to signal handling are shown in Table 11-4.
表 11-4。信号描述符中与信号处理相关的字段
Table 11-4. The fields of the signal descriptor related to signal handling
除了信号描述符之外,每个进程还引用一个
信号处理描述符,它是一个sighand_struct描述线程组必须如何处理每个信号的结构。其字段如表11-5所示。
Besides the signal descriptor, every process refers also to a
signal handler descriptor, which is a sighand_struct structure describing how
each signal must be handled by the thread group. Its fields are
shown in Table
11-5.
表 11-5。信号处理描述符的字段
Table 11-5. The fields of the signal handler descriptor
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 信号处理程序描述符的使用计数器 Usage counter of the signal handler descriptor |
| | 指定传递信号时要执行的操作的结构体数组 Array of structures specifying the actions to be performed upon delivering the signals |
| | 自旋锁保护信号描述符和信号处理程序描述符 Spin lock protecting both the signal descriptor and the signal handler descriptor |
正如第 3 章“ clone()、fork() 和 vfork() 系统调用”一节中提到的,通过调用设置了标志的系统
调用,信号处理程序描述符可以被多个进程共享
。该描述符中的字段
指定共享该结构的进程数。在 POSIX 多线程应用程序中,线程组中的所有轻量级进程都引用相同的信号描述符和相同的信号处理程序描述符。clone( )CLONE_SIGHANDcount
As mentioned in the section "The clone( ), fork( ), and
vfork( ) System Calls" in Chapter 3, the signal handler
descriptor may be shared by several processes by invoking the
clone( ) system call with the
CLONE_SIGHAND flag set; the
count field in this descriptor
specifies the number of processes that share the structure. In a
POSIX multithreaded application, all lightweight processes in the
thread group refer to the same signal descriptor and to the same
signal handler descriptor.
某些体系结构将属性分配给仅对内核可见的信号。因此,信号的属性存储在一个k_sigaction
结构中,该结构既包含对用户模式进程隐藏的属性,又包含更熟悉的sigaction结构,该结构包含用户模式进程可以看到的所有属性。实际上,在80×86平台上,所有信号属性对于用户模式进程都是可见的。因此,该结构简单地简化为
类型 的k_sigaction单个结构,其中包括以下字段:[ * ]sasigaction
Some architectures assign properties to a signal that
are visible only to the kernel. Thus, the properties of a signal are
stored in a k_sigaction
structure, which contains both the properties hidden from the User
Mode process and the more familiar sigaction structure that holds all the
properties a User Mode process can see. Actually, on the 80 × 86
platform, all signal properties are visible to User Mode processes.
Thus the k_sigaction structure
simply reduces to a single sa
structure of type sigaction,
which includes the following fields:[*]
sa_handlersa_handler该字段指定要执行的操作类型;它的值可以是指向信号处理程序的指针,SIG_DFL(即值 0)指定执行默认操作,或者SIG_IGN(即值 1)指定忽略信号。
This field specifies the type of action to be performed;
its value can be a pointer to the signal handler, SIG_DFL (that is, the value 0) to
specify that the default action is performed, or SIG_IGN (that is, the value 1) to
specify that the signal is ignored.
sa_flagssa_flagsThis set of flags specifies how the signal must be handled; some of them are listed in Table 11-6.[†]
sa_masksa_mask该sigset_t
变量指定运行信号处理程序时要屏蔽的信号。
This sigset_t
variable specifies the signals to be masked when running the
signal handler.
表 11-6。指定如何处理信号的标志
Table 11-6. Flags specifying how to handle a signal
旗帜名称 Flag Name | 描述 Description |
|---|---|
| 仅适用于 Applies only to |
SA_NOCLD等待 SA_NOCLDWAIT | 仅适用于 Applies only to |
| 向信号处理程序提供附加信息(请参阅后面的“更改信号操作”部分) Provide additional information to the signal handler (see the later section "Changing a Signal Action") |
| Use an alternative stack for the signal handler (see the later section "Catching the Signal") |
| 中断的系统调用自动重新启动(参见后面章节“系统调用的重新执行”) Interrupted system calls are automatically restarted (see the later section "Reexecution of System Calls") |
| 执行信号处理程序时不要屏蔽信号 Do not mask the signal while executing the signal handler |
| 执行信号处理程序后重置为默认操作 Reset to default action after executing the signal handler |
正如我们在本章前面的表 11-2中看到的,有几个系统调用可以生成信号:其中一些 -kill( )
和rt_sigqueueinfo(
) —向整个线程组发送信号,而其他线程组—tkill( ) 和tgkill( )
——向特定进程发送信号。
As we have seen in Table 11-2 earlier in
the chapter, there are several system calls that can generate a
signal: some of them—kill( )
and rt_sigqueueinfo(
) —send a signal to a whole thread group, while
others—tkill( ) and tgkill( )
—send a signal to a specific process.
因此,为了跟踪当前待处理的信号,内核将两个待处理信号队列与每个进程相关联:
Thus, in order to keep track of what signals are currently pending, the kernel associates two pending signal queues to each process:
共享挂起信号队列以信号shared_pending
描述符字段为根,存储整个线程组的挂起信号。
The shared pending signal queue,
rooted at the shared_pending
field of the signal descriptor, stores the pending signals of
the whole thread group.
私有挂起信号队列,植根于pending进程描述符的字段,存储特定(轻量级)进程的挂起信号。
The private pending signal queue,
rooted at the pending field
of the process descriptor, stores the pending signals of the
specific (lightweight) process.
待处理信号队列由一个sigpending数据结构组成,其定义如下:
A pending signal queue consists of a sigpending data structure, which is
defined as follows:
结构体 sigpending {
struct list_head 列表;
sigset_t 信号;
} struct sigpending {
struct list_head list;
sigset_t signal;
}该signal字段是指定待处理信号的位掩码,而该list字段是包含数据结构的双向链表的头sigqueue;该结构体的字段如表11-7所示。
The signal field is a bit
mask specifying the pending signals, while the list field is the head of a doubly linked
list containing sigqueue data
structures; the fields of this structure are shown in Table 11-7.
表 11-7。sigqueue数据结构的字段
Table 11-7. The fields of the sigqueue data structure
类型 Type | 姓名 Name | 描述 Description |
|---|---|---|
| | 待处理信号队列列表的链接 Links for the pending signal queue's list |
| | 指向 Pointer to the |
| |
Flags of the |
| | 描述引发信号的事件 Describes the event that raised the signal |
| | 指向进程所有者的每用户数据结构的指针(请参阅第 3 章中的“ clone()、fork() 和 vfork() 系统调用”部分) Pointer to the per-user data structure of the process's owner (see the section "The clone( ), fork( ), and vfork( ) System Calls" in Chapter 3) |
数据siginfo_t结构是128字节的数据结构,存储有关特定信号出现的信息;它包括以下字段:
The siginfo_t data
structure is a 128-byte data structure that stores information about
an occurrence of a specific signal; it includes the following
fields:
si_signosi_signo信号编号
The signal number
si_errnosi_errno导致信号产生的指令的错误代码,如果没有错误则为 0
The error code of the instruction that caused the signal to be raised, or 0 if there was no error
si_codesi_code识别谁发出信号的代码(见表11-8)
A code identifying who raised the signal (see Table 11-8)
表 11-8。最重要的信号发送者代码
Table 11-8. The most significant signal sender codes
代码名称 Code Name | 发件人 Sender |
|---|---|
| |
| 通用核函数 Generic kernel function |
SI_队列 SI_QUEUE | |
| 定时器到期 Timer expiration |
| 异步 I/O 完成 Asynchronous I/O completion |
SI_TKILL SI_TKILL | |
_sifields_sifields根据信号类型存储信息的联合。例如,siginfo_t与信号发生相关的数据结构SIGKILL记录了PID和UID发送者进程在这里;相反,与信号发生相关的数据结构SIGSEGV存储其访问导致信号被提升的存储器地址。
A union storing information depending on the type of
signal. For instance, the siginfo_t data structure relative to
an occurrence of the SIGKILL signal records the PID and
the UID of the sender process here; conversely, the
data structure relative to an occurrence of the SIGSEGV signal stores the memory
address whose access caused the signal to be raised.
内核使用几个函数和宏来处理信号。在下面的描述中,set是指向变量的指针sigset_t,nsig是信号的编号,mask是unsigned
long位掩码。
Several functions and macros are used by the kernel to
handle signals. In the following description, set is a pointer to a sigset_t variable, nsig is the number of a signal, and mask is an unsigned
long bit mask.
sigemptyset(set)和
sigfillset(set)sigemptyset(set) and
sigfillset(set)将变量中的位sigset_t分别设置为 0 或 1。
Sets the bits in the sigset_t variable to 0 or 1,
respectively.
sigaddset(set,nsig)和
sigdelset(set,nsig)sigaddset(set,nsig) and
sigdelset(set,nsig)sigset_t将信号对应的变量位nsig分别设置为 1 或 0。在实践中,sigaddset(
)减少为:
设置->sig[(nsig - 1) / 32] |= 1UL << ((nsig - 1) % 32);
并sigdelset( )
致:
设置->sig[(nsig - 1) / 32] &= ~(1UL << ((nsig - 1) % 32));
Sets the bit of the sigset_t variable corresponding to
signal nsig to 1 or 0,
respectively. In practice, sigaddset(
) reduces to:
set->sig[(nsig - 1) / 32] |= 1UL << ((nsig - 1) % 32);
and sigdelset( )
to:
set->sig[(nsig - 1) / 32] &= ~(1UL << ((nsig - 1) % 32));
sigaddsetmask(set,mask)
和sigdelsetmask(set,mask)sigaddsetmask(set,mask)
and sigdelsetmask(set,mask)sigset_t设置变量中相应位mask分别为 1 或 0的所有位。它们只能与 1 到 32 之间的信号一起使用。相应的函数简化为:
设置->sig[0] |= 掩码;
并:
设置->sig[0] &= ~mask;
Sets all the bits of the sigset_t variable whose corresponding
bits of mask are on 1 or 0,
respectively. They can be used only with signals that are
between 1 and 32. The corresponding functions reduce to:
set->sig[0] |= mask;
and to:
set->sig[0] &= ~mask;
sigismember(set,nsig)sigismember(set,nsig)sigset_t返回与信号对应的变量位的值nsig。在实践中,这个函数简化为:
返回 1 & (set->sig[(nsig-1) / 32] >> ((nsig-1) % 32));
Returns the value of the bit of the sigset_t variable corresponding to the
signal nsig. In practice,
this function reduces to:
return 1 & (set->sig[(nsig-1) / 32] >> ((nsig-1) % 32));
sigmask(nsig)sigmask(nsig)产生信号的位索引nsig。换句话说,如果内核需要设置、清除或测试与sigset_t特定信号相对应的元素中的某个位,它可以通过该宏导出正确的位。
Yields the bit index of the signal nsig. In other words, if the kernel
needs to set, clear, or test a bit in an element of sigset_t that corresponds to a
particular signal, it can derive the proper bit through this
macro.
sigandsets(d,s1,s2),
sigorsets(d,s1,s2), 和
signandsets(d,s1,s2)sigandsets(d,s1,s2),
sigorsets(d,s1,s2), and
signandsets(d,s1,s2)在和指向sigset_t的变量之间分别执行逻辑 AND、逻辑 OR 和逻辑 NAND ;结果存储在
指向的变量
中。s1s2sigset_td
Performs a logical AND, a logical OR, and a logical NAND,
respectively, between the sigset_t variables to which s1 and s2 point; the result is stored in the
sigset_t variable to which
d points.
sigtestsetmask(set,mask)sigtestsetmask(set,mask)sigset_t如果变量中与设置为 1 的位相对应的任何位mask被设置,则返回值 1 ;否则返回 0。它只能与数字在 1 到 32 之间的信号一起使用。
Returns the value 1 if any of the bits in the sigset_t variable that correspond to
the bits set to 1 in mask is
set; it returns 0 otherwise. It can be used only with signals
that have a number between 1 and 32.
siginitset(set,mask)siginitset(set,mask)sigset_t用 中包含的位初始化 1 到 32 之间的信号对应的变量的低位mask,并清除 33 到 63 之间的信号对应的位。
Initializes the low bits of the sigset_t variable corresponding to
signals between 1 and 32 with the bits contained in mask, and clears the bits
corresponding to signals between 33 and 63.
siginitsetinv(set,mask)siginitsetinv(set,mask)sigset_t用 中包含的位的补码初始化 1 到 32 之间的信号对应的变量的低位mask,并设置 33 到 63 之间的信号对应的位。
Initializes the low bits of the sigset_t variable corresponding to
signals between 1 and 32 with the complement of the bits
contained in mask, and sets
the bits corresponding to signals between 33 and 63.
signal_pending(p)signal_pending(p)如果进程描述符标识的进程具有非阻塞挂起信号,则返回值 1 (true) *p;如果没有,则返回值 0 (false)。该函数的实现是对
TIF_SIGPENDING进程标志的简单检查。
Returns the value 1 (true) if the process identified by
the *p process descriptor has
nonblocked pending signals, and returns the value 0 (false) if
it doesn't. The function is implemented as a simple check on the
TIF_SIGPENDING flag of the
process.
recalc_sigpending_tsk(t)
和recalc_sigpending( )recalc_sigpending_tsk(t)
and recalc_sigpending( )第一个函数检查进程描述符所标识的进程*t(通过查看该t->pending->signal字段)或该进程所属的线程组(通过查看该
t->signal->shared_pending->signal
字段)是否有挂起信号。然后该函数相应地设置TIF_SIGPENDING中的标志t->thread_info->flags。该
recalc_sigpending( )功能相当于recalc_sigpending_tsk(current).
The first function checks whether there are pending
signals either for the process identified by the process
descriptor at *t (by looking
at the t->pending->signal field) or for
the thread group to which the process belongs (by looking at the
t->signal->shared_pending->signal
field). The function then sets accordingly the TIF_SIGPENDING flag in t->thread_info->flags. The
recalc_sigpending( ) function
is equivalent to recalc_sigpending_tsk(current).
rm_from_queue(mask,q)rm_from_queue(mask,q)从待处理信号队列中删除与q位掩码对应的待处理信号mask。
Removes from the pending signal queue q the pending signals corresponding to
the bit mask mask.
flush_sigqueue(q)flush_sigqueue(q)从待处理信号队列中删除q所有待处理信号。
Removes from the pending signal queue q all pending signals.
flush_signals(t)flush_signals(t)删除发送到由进程描述符标识的进程的所有信号*t。这是通过清除和队列TIF_SIGPENDING中的标志t->thread_info->flags并调用两次来完成的
。flush_sigqueue(
)t->pendingt->signal->shared_pending
Deletes all signals sent to the process identified by the
process descriptor at *t.
This is done by clearing the TIF_SIGPENDING flag in t->thread_info->flags and
invoking twice flush_sigqueue(
) on the t->pending and t->signal->shared_pending
queues.
[ * ]如果进程在被跟踪时收到信号,内核将停止该进程并通过SIGCHLD向其发送信号来通知跟踪进程。跟踪进程又可以通过信号恢复被跟踪进程的执行SIGCONT
。
[*] If a process receives a signal while it is being
traced, the kernel stops the process and notifies the
tracing process by sending a SIGCHLD signal to it. The tracing
process may, in turn, resume execution of the traced process
by means of a SIGCONT
signal.
[ * ]有两个例外:不可能向进程 0 ( swapper ) 发送信号,并且发送到进程 1 ( init ) 的信号总是被丢弃,除非它们被捕获。因此,进程 0 永远不会死亡,而进程 1 仅在init程序终止时才死亡。
[*] There are two exceptions: it is not possible to send a signal to process 0 (swapper), and signals sent to process 1 (init) are always discarded unless they are caught. Therefore, process 0 never dies, while process 1 dies only when the init program terminates.
[ * ]用户模式应用程序sigaction用于将参数传递给
signal( ) 和sigaction(
) 系统调用与内核使用的结构略有不同,尽管它存储的信息本质上相同。
[*] The sigaction structure
used by User Mode applications to pass parameters to the
signal( ) and sigaction(
) system calls is slightly different from the
structure used by the kernel, although it stores essentially the
same information.
许多内核函数生成信号:它们通过根据需要更新一个或多个进程描述符来完成信号处理的第一阶段(在前面的“信号的作用”部分中描述) 。它们不直接执行传递信号的第二阶段,但根据信号的类型和目标进程的状态,可能会唤醒某些进程并强制它们接收信号。
Many kernel functions generate signals: they accomplish the first phase of signal handling—described earlier in the section "The Role of Signals"—by updating one or more process descriptors as needed. They do not directly perform the second phase of delivering the signal but, depending on the type of signal and the state of the destination processes, may wake up some processes and force them to receive the signal.
当信号从内核或另一个进程发送到进程时,内核通过调用表 11-9中列出的函数之一来生成该信号。
When a signal is sent to a process, either from the kernel or from another process, the kernel generates it by invoking one of the functions listed in Table 11-9.
表 11-9。为进程生成信号的内核函数
Table 11-9. Kernel functions that generate a signal for a process
姓名 Name | 描述 Description |
|---|---|
| 向单个进程发送信号 Sends a signal to a single process |
| 就像,在
结构 Like |
| 发送进程无法显式忽略或阻止的信号 Sends a signal that cannot be explicitly ignored or blocked by the process |
| 就像,在
结构 Like |
| 类似 Like |
sys_tkill() sys_tkill( ) | 系统调用处理程序 System call handler of |
sys_tgkill() sys_tgkill( ) |
表 11-9中的所有函数最终都会调用specific_send_sig_info( )下一节中描述的函数。
All functions in Table
11-9 end up invoking the specific_send_sig_info( ) function described
in the next section.
当信号从内核或另一个进程发送到整个线程组时,内核通过调用表 11-10中列出的函数之一来生成该信号。
When a signal is sent to a whole thread group, either from the kernel or from another process, the kernel generates it by invoking one of the functions listed in Table 11-10.
表 11-10。为线程组生成信号的内核函数
Table 11-10. Kernel functions that generate a signal for a thread group
姓名 Name | 描述 Description |
|---|---|
| 向由其成员之一的进程描述符标识的单个线程组发送信号 Sends a signal to a single thread group identified by the process descriptor of one of its members |
| 向进程组中的所有线程组发送信号(参见第一章“进程管理”部分) Sends a signal to all thread groups in a process group (see the section "Process Management" in Chapter 1) |
| 就像,在结构 Like |
| 向由其成员之一的 PID 标识的单个线程组发送信号 Sends a signal to a single thread group identified by the PID of one of its members |
| 就像,在
结构 Like |
sys_kill() sys_kill( ) | 系统调用处理程序 System call handler of |
sys_rt_sigqueueinfo() sys_rt_sigqueueinfo( ) | 系统调用处理程序 System call handler of |
表11-10中的所有函数最终都会调用该函数,该函数将在后面的“ group_send_sig_info()函数group_send_sig_info( )”
部分中进行描述。
All functions in Table 11-10 end up
invoking the group_send_sig_info( )
function, which is described in the later section "The group_send_sig_info( )
Function."
该specific_send_sig_info(
)函数向特定进程发送信号。它作用于三个参数:
The specific_send_sig_info(
) function sends a signal to a specific process. It acts on
three parameters:
sigsig信号编号。
The signal number.
infoinfo表的地址siginfo_t或三个特殊值之一:0 表示信号已由用户模式进程发送,1 表示信号已由内核发送,2 表示信号已由内核发送,信号是
SIGSTOP或SIGKILL。
Either the address of a siginfo_t table or one of three
special values: 0 means that the signal has been sent by a User
Mode process, 1 means that it has been sent by the kernel, and 2
means that is has been sent by the kernel and the signal is
SIGSTOP or SIGKILL.
tt指向目标进程描述符的指针。
A pointer to the descriptor of the destination process.
specific_send_sig_info( )
必须在禁用本地中断且已获取自旋锁的情况下调用
该函数t->sighand->siglock。该函数执行以下步骤:
The specific_send_sig_info( )
function must be invoked with local interrupts disabled and the
t->sighand->siglock spin lock
already acquired. The function executes the following steps:
检查进程是否忽略该信号;如果是,则返回 0(未生成信号)。当忽略信号的三个条件全部满足时,该信号将被忽略,即:
未跟踪该进程(清除PT_PTRACED标志t->ptrace)。
信号未被阻塞(sigismember(&t->blocked, sig)
返回 0)。
信号要么被显式忽略(sa_handler的字段t->sighand->action[sig-1]等于SIG_IGN),要么被隐式忽略(sa_handler字段等于
SIG_DFL且信号为
SIGCONT、SIGCHLD、SIGWINCH或SIGURG)。
Checks whether the process ignores the signal; in the affirmative case, returns 0 (signal not generated). The signal is ignored when all three conditions for ignoring a signal are satisfied, that is:
The process is not being traced (PT_PTRACED flag in t->ptrace clear).
The signal is not blocked (sigismember(&t->blocked, sig)
returns 0).
The signal is either explicitly ignored (the sa_handler field of t->sighand->action[sig-1] is
equal to SIG_IGN) or
implicitly ignored (the sa_handler field is equal to
SIG_DFL and the signal is
SIGCONT, SIGCHLD, SIGWINCH, or SIGURG).
检查信号是否是非实时的 ( sig<32) 并且相同信号的另一个出现已在进程的私有挂起信号队列中挂起(sigismember(&t->pending.signal,sig)
返回 1):如果是,则无需执行任何操作,因此返回 0 。
Checks whether the signal is non-real-time (sig<32) and another occurrence of the
same signal is already pending in the private pending signal queue
of the process (sigismember(&t->pending.signal,sig)
returns 1): in the affirmative case, nothing has to be done, thus
returns 0.
调用send_signal(sig, info, t,
&t->pending)将信号添加到进程的待处理信号集中;该函数将在下一节中详细描述。
Invokes send_signal(sig, info, t,
&t->pending) to add the signal to the set of
pending signals of the process; this function is described in
detail in the next section.
如果send_signal( )
成功终止并且信号未被阻止(sigismember(&t->blocked,sig)
返回 0),则调用该signal_wake_up(
)函数以通知进程有关新的挂起信号。该函数依次执行以下步骤:
TIF_SIGPENDING在 中设置标志t->thread_info->flags。
调用——请参阅第 7 章中的“ try_to_wake_up( ) 函数”try_to_wake_up(
)部分——如果进程处于状态,或者处于
状态且信号为 ,则唤醒该进程。TASK_INTERRUPTIBLETASK_STOPPEDSIGKILL
如果try_to_wake_up( )
返回 0,则该进程已可运行:如果是,它将检查该进程是否已在另一个 CPU 上运行,在这种情况下,向该 CPU 发送处理器间中断以强制重新调度当前进程(请参阅“处理器间中断处理”(第 4 章)。由于每个进程在从函数返回时都会检查挂起信号的存在schedule( )
,因此处理器间中断可确保目标进程快速注意到新的挂起信号。
If send_signal( )
successfully terminated and the signal is not blocked (sigismember(&t->blocked,sig)
returns 0), invokes the signal_wake_up(
) function to notify the process about the new pending
signal. In turn, this function executes the following
steps:
Sets the TIF_SIGPENDING flags in t->thread_info->flags.
Invokes try_to_wake_up(
)—see the section "The try_to_wake_up( )
Function" in Chapter
7—to awake the process if it is either in TASK_INTERRUPTIBLE state, or in
TASK_STOPPED state and the
signal is SIGKILL.
If try_to_wake_up( )
returned 0, the process was already runnable: if so, it checks
whether the process is already running on another CPU and, in
this case, sends an interprocessor interrupt to that CPU to
force a reschedule of the current process (see the section
"Interprocessor
Interrupt Handling" in Chapter 4). Because each
process checks the existence of pending signals when returning
from the schedule( )
function, the interprocessor interrupt ensures that the
destination process quickly notices the new pending
signal.
返回1(信号已成功生成)。
Returns 1 (the signal has been successfully generated).
该send_signal( )
函数在待处理信号队列中插入一个新项目。它接收信号号、数据结构的sig地址(或特殊代码,请参阅上一节的描述)、目标进程的描述符的地址以及待处理信号队列的地址作为其参数。infosiginfo_tspecific_send_sig_info( )tsignals
The send_signal( )
function inserts a new item in a pending signal queue. It receives as
its parameters the signal number sig, the address info of a siginfo_t data structure (or a special code,
see the description of specific_send_sig_info( ) in the previous
section), the address t of the
descriptor of the target process, and the address signals of the pending signal queue.
该函数执行以下步骤:
The function executes the following steps:
如果 的值为info2,则信号为SIGKILL
orSIGSTOP并且它是由内核通过函数生成的force_sig_specific( ):在这种情况下,它跳转到步骤 9。与这些信号相对应的操作由内核立即执行,因此该函数可以跳过将信号添加到待处理信号队列。
If the value of info is
2, the signal is either SIGKILL
or SIGSTOP and it has been
generated by the kernel via the force_sig_specific( ) function: in this
case, it jumps to step 9. The action corresponding to these
signals is immediately enforced by the kernel, thus the function
may skip adding the signal to the pending signal queue.
如果进程所有者的待处理信号数量 ( t->user->sigpending) 小于当前进程的资源限制(t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur),则该函数sigqueue为新出现的信号分配一个数据结构:
q = kmem_cache_alloc(sigqueue_cachep, GFP_ATOMIC);
If the number of pending signals of the process's owner
(t->user->sigpending) is
smaller than the current process's resource limit (t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur),
the function allocates a sigqueue data structure for the new
occurrence of the signal:
q = kmem_cache_alloc(sigqueue_cachep, GFP_ATOMIC);
如果进程所有者的待处理信号数量过多或者上一步内存分配失败,则跳转到步骤9。
If the number of pending signals of the process's owner is too high or the memory allocation in the previous step failed, it jumps to step 9.
增加owner( t->user->sigpending)的挂起信号的数量以及由 指向的每用户数据结构的引用计数器
t->user。
Increases the number of pending signals of the owner
(t->user->sigpending) and
the reference counter of the per-user data structure pointed to by
t->user.
sigqueue在待处理信号队列中添加数据结构signals:
list_add_tail(&q->列表, &signals->列表);
Adds the sigqueue data
structure in the pending signal queue signals:
list_add_tail(&q->list, &signals->list);
填充数据结构siginfo_t内的表sigqueue:
如果((无符号长)信息== 0){
q->info.si_signo = sig;
q->info.si_errno = 0;
q->info.si_code = SI_USER;
q->info._sifields._kill._pid = 当前->pid;
q->info._sifields._kill._uid = 当前->uid;
} else if ((unsigned long)info == 1) {
q->info.si_signo = sig;
q->info.si_errno = 0;
q->info.si_code = SI_KERNEL;
q->info._sifields._kill._pid = 0;
q->info._sifields._kill._uid = 0;
} 别的
copy_siginfo(&q->信息, 信息);该copy_siginfo( )
函数复制siginfo_t
调用者传递的表。
Fills the siginfo_t table
inside the sigqueue data
structure:
if ((unsigned long)info == 0) {
q->info.si_signo = sig;
q->info.si_errno = 0;
q->info.si_code = SI_USER;
q->info._sifields._kill._pid = current->pid;
q->info._sifields._kill._uid = current->uid;
} else if ((unsigned long)info == 1) {
q->info.si_signo = sig;
q->info.si_errno = 0;
q->info.si_code = SI_KERNEL;
q->info._sifields._kill._pid = 0;
q->info._sifields._kill._uid = 0;
} else
copy_siginfo(&q->info, info);The copy_siginfo( )
function copies the siginfo_t
table passed by the caller.
设置队列的位掩码中信号对应的位:
sigaddset(&signals->signal, sig);
Sets the bit corresponding to the signal in the bit mask of the queue:
sigaddset(&signals->signal, sig);
返回 0:信号已成功追加到待处理信号队列中。
Returns 0: the signal has been successfully appended to the pending signal queue.
这里,一个项目不会被添加到信号挂起队列中,因为已经有太多挂起信号,或者没有用于数据结构的空闲内存sigqueue
,或者信号立即被内核强制执行。如果信号是实时的并且是通过明确需要对其进行排队的内核函数发送的,则该函数将返回错误代码-EAGAIN:
if (sig>=32 && 信息 && (无符号长整型) 信息 != 1 &&
信息->si_code!= SI_USER)
返回-EAGAIN;Here, an item will not be added to the signal pending queue,
because there are already too many pending signals, or there is no
free memory for the sigqueue
data structure, or the signal is immediately enforced by the
kernel. If the signal is real-time and was sent through a kernel
function that is explicitly required to queue it, the function
returns the error code -EAGAIN:
if (sig>=32 && info && (unsigned long) info != 1 &&
info->si_code != SI_USER)
return -EAGAIN;设置队列的位掩码中信号对应的位:
sigaddset(&signals->signal, sig);
Sets the bit corresponding to the signal in the bit mask of the queue:
sigaddset(&signals->signal, sig);
返回 0:即使信号尚未追加到队列中,相应位也已在待处理信号的位掩码中设置。
Returns 0: even if the signal has not been appended to the queue, the corresponding bit has been set in the bit mask of pending signals.
即使待处理信号队列中没有空间容纳相应的项目,让目标进程接收信号也很重要。例如,假设某个进程消耗了太多内存。内核必须确保kill( ) 即使没有可用内存,系统调用也会成功;否则,系统管理员没有任何机会通过终止有问题的进程来恢复系统。
It is important to let the destination process receive the
signal even if there is no room for the corresponding item in the
pending signal queue. Suppose, for instance, that a process is
consuming too much memory. The kernel must ensure that the kill( ) system call succeeds even if there is no free memory;
otherwise, the system administrator doesn't have any chance to recover
the system by terminating the offending process.
该group_send_sig_info(
)函数向整个线程组发送信号。它作用于三个参数:信号号、表的sig地址(或者特殊值 0、1 或 2,如前面部分“ specic_send_sig_info( ) 函数”中所述)以及
进程描述符的地址。infosiginfo_tp
The group_send_sig_info(
) function sends a signal to a whole thread group. It acts
on three parameters: a signal number sig, the address info of a siginfo_t table—or alternatively the special
values 0, 1, or 2, as explained in the earlier section "The specific_send_sig_info( )
Function"—and the address p
of a process descriptor.
该函数主要执行以下步骤:
The function essentially executes the following steps:
检查参数sig是否正确:
if (sig < 0 || sig > 64)
返回-EINVAL;Checks that the parameter sig is correct:
if (sig < 0 || sig > 64)
return -EINVAL;如果信号是由用户模式进程发送的,则它检查是否允许该操作。仅当至少满足以下条件之一时才会发送信号:
发送进程的所有者具有适当的能力(通常,这仅意味着该信号是由系统管理员发出的;请参阅第 20 章)。
该信号SIGCONT
与目标进程位于发送进程的同一登录会话中。
两个进程属于同一用户。
如果不允许用户模式进程发送信号,则该函数返回值-EPERM。
If the signal is being sent by a User Mode process, it checks whether the operation is allowed. The signal is delivered only if at least one of the following conditions holds:
The owner of the sending process has the proper capability (usually, this simply means the signal was issued by the system administrator; see Chapter 20).
The signal is SIGCONT
and the destination process is in the same login session of
the sending process.
Both processes belong to the same user.
If the User Mode process is not allowed to send the signal,
the function returns the value -EPERM.
如果sig参数值为 0,则立即返回,不产生任何信号:
if (!sig || !p->sighand)
返回0;由于 0 不是有效的信号编号,因此它用于允许发送进程检查其是否具有向目标线程组发送信号所需的权限。如果目标进程正在被终止,该函数也会返回,通过检查其信号处理程序描述符是否已被释放来指示。
If the sig parameter has
the value 0, it returns immediately without generating any
signal:
if (!sig || !p->sighand)
return 0;Because 0 is not a valid signal number, it is used to allow the sending process to check whether it has the required privileges to send a signal to the destination thread group. The function also returns if the destination process is being killed, indicated by checking whether its signal handler descriptor has been released.
获取p->sighand->siglock自旋锁并禁用本地中断。
Acquires the p->sighand->siglock spin lock and
disables local interrupts.
调用该handle_stop_signal(
)函数,该函数检查某些类型的信号,这些信号可能会使目标线程组的其他挂起信号无效。后一个函数执行以下步骤:
如果线程组正在被终止(信号描述符集字段SIGNAL_GROUP_EXIT中的标志
),则它返回。flags
如果sig是SIGSTOP、SIGTSTP、SIGTTIN、 或SIGTTOU信号,则该函数调用该函数以从共享挂起信号队列和线程组所有成员的专用队列中rm_from_queue( )
删除该信号。SIGCONTp->signal->shared_pending
如果sig是SIGCONT,则调用该函数从共享挂起信号队列中rm_from_queue( )删除任何SIGSTOP、SIGTSTP、SIGTTIN和信号;然后,从属于该线程组的进程的私有挂起信号队列中删除相同的信号,并唤醒它们:SIGTTOUp->signal->shared_pending
rm_from_queue(0x003c0000, &p->信号->shared_pending);
t = p;
做 {
rm_from_queue(0x003c0000, &t->待处理);
try_to_wake_up(t, TASK_STOPPED, 0);
t = next_thread(t);
} while (t != p);掩码0x003c0000
选择四个停止信号。在每次迭代时,next_thread宏都会返回线程组中不同轻量级进程的描述符地址(请参阅第 3 章中的“进程之间的关系”部分)。[ * ]
Invokes the handle_stop_signal(
) function, which checks for some types of signals that
might nullify other pending signals for the destination thread
group. The latter function executes the following steps:
If the thread group is being killed (SIGNAL_GROUP_EXIT flag in the
flags field of the signal
descriptor set), it returns.
If sig is a SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal, the function invokes
the rm_from_queue( )
function to remove the SIGCONT signal from the shared
pending signal queue p->signal->shared_pending and
from the private queues of all members of the thread
group.
If sig is SIGCONT, it invokes the rm_from_queue( ) function to remove
any SIGSTOP, SIGTSTP, SIGTTIN, and SIGTTOU signal from the shared
pending signal queue p->signal->shared_pending;
then, removes the same signals from the private pending signal
queues of the processes belonging to the thread group, and
awakens them:
rm_from_queue(0x003c0000, &p->signal->shared_pending);
t = p;
do {
rm_from_queue(0x003c0000, &t->pending);
try_to_wake_up(t, TASK_STOPPED, 0);
t = next_thread(t);
} while (t != p);The mask 0x003c0000
selects the four stop signals. At each iteration, the next_thread macro returns the
descriptor address of a different lightweight process in the
thread group (see the section "Relationships Among
Processes" in Chapter 3).[*]
检查线程组是否忽略该信号;如果是,则返回值 0(成功)。当前面部分“信号的作用”中提到的忽略信号的所有三个条件都满足时,该信号将被忽略(另请参见前面部分“ Specific_send_sig_info()函数”中的步骤1)。
Checks whether the thread group ignores the signal; if so, returns the value 0 (success). The signal is ignored when all three conditions for ignoring a signal that are mentioned in the earlier section "The Role of Signals" are satisfied (see also step 1 in the earlier section "The specific_send_sig_info( ) Function").
检查信号是否是非实时的,并且同一信号的另一个出现已在线程组的共享挂起信号队列中挂起:如果是,则无需执行任何操作,因此返回值 0(成功):
if (sig<32 && sigismember(&p->signal->shared_pending.signal,sig))
返回0;Checks whether the signal is non-real-time and another occurrence of the same signal is already pending in the shared pending signal queue of the thread group: if so, nothing has to be done, thus returns the value 0 (success):
if (sig<32 && sigismember(&p->signal->shared_pending.signal,sig))
return 0;调用send_signal( )将信号附加到共享待处理信号队列(请参阅上一节“ send_signal( ) 函数”)。如果send_signal( )返回非零错误代码,则它在返回相同值时终止。
Invokes send_signal( ) to
append the signal to the shared pending signal queue (see the
previous section "The
send_signal( ) Function"). If send_signal( ) returns a nonzero error
code, it terminates while returning the same value.
调用该_
_group_complete_signal( )函数来唤醒线程组中的一个轻量级进程(见下文)。
Invokes the _
_group_complete_signal( ) function to wake up one
lightweight process in the thread group (see below).
释放p->sighand->siglock自旋锁并启用本地中断。
Releases the p->sighand->siglock spin lock and
enables local interrupts.
返回 0(成功)。
Returns 0 (success).
该_ _group_complete_signal(
)函数扫描线程组中的进程,寻找可以接收新信号的进程。如果一个流程满足以下所有条件,则可以选择该流程:
The _ _group_complete_signal(
) function scans the processes in the thread group, looking
for a process that can receive the new signal. A process may be
selected if it satisfies all the following conditions:
该进程不会阻塞信号。
The process does not block the signal.
进程不处于状态EXIT_ZOMBIE、EXIT_DEAD、TASK_TRACED或TASK_STOPPED(作为例外,如果信号为 ,则
进程可以处于TASK_TRACED或状态)。TASK_STOPPEDSIGKILL
The process is not in state EXIT_ZOMBIE, EXIT_DEAD, TASK_TRACED, or TASK_STOPPED (as an exception, the
process can be in the TASK_TRACED or TASK_STOPPED states if the signal is
SIGKILL).
该进程没有被终止——也就是说,它的PF_EXITING标志没有设置。
The process is not being killed—that is, its PF_EXITING flag is not set.
该进程当前正在 CPU 上执行,或者其TIF_SIGPENDING标志尚未设置。(事实上,唤醒具有挂起信号的进程是没有意义的:一般来说,设置标志的内核控制路径已经执行了此操作。另一方面,如果进程当前正在执行,则TIF_SIGPENDING它应通知新的待处理信号。)
Either the process is currently in execution on a CPU, or
its TIF_SIGPENDING flag is not
already set. (In fact, there is no point in awakening a process
that has pending signals: in general, this operation has been
already performed by the kernel control path that set the TIF_SIGPENDING flag. On the other hand,
if a process is currently in execution, it should be notified of
the new pending signal.)
一个线程组可能包含许多满足上述条件的进程。该函数选择其中之一,如下所示:
A thread group might include many processes that satisfy the above conditions. The function selects one of them as follows:
p如果由(作为函数参数传递的描述符地址)标识的进程group_send_sig_info(
)满足所有先前规则并因此可以接收信号,则函数会选择它。
If the process identified by p—the descriptor address passed as
parameter of the group_send_sig_info(
) function—satisfies all the prior rules and can thus
receive the signal, the function selects it.
否则,该函数通过扫描线程组的成员来搜索合适的进程,从接收到最后一个线程组信号( )的进程开始p->signal->curr_target。
Otherwise, the function searches for a suitable process by
scanning the members of the thread group, starting from the
process that received the last thread group's signal (p->signal->curr_target).
如果_ _group_complete_signal(
)成功找到合适的进程,它将设置向所选进程传送信号。SIGKILL
首先,该函数检查信号是否致命:在这种情况下,通过向组中的每个轻量级进程发送信号来杀死整个线程组。否则,如果该信号不是致命的,则该函数调用该signal_wake_up( )函数来通知所选进程它有一个新的挂起信号(请参阅前面部分“ specic_send_sig_info()函数”中的步骤4)。
If _ _group_complete_signal(
) succeeds in finding a suitable process, it sets up the
delivery of the signal to the selected process. First, the function
checks whether the signal is fatal: in this case, the whole thread
group is killed by sending SIGKILL
signals to each lightweight process in the group. Otherwise, if the
signal is not fatal, the function invokes the signal_wake_up( ) function to notify the
selected process that it has a new pending signal (see step 4 in the
earlier section "The
specific_send_sig_info( ) Function").
[ * ]实际代码比刚刚显示的片段更复杂,因为handle_stop_signal( )还处理了SIGCONT信号被捕获的异常情况,以及由于SIGCONT线程组中的所有进程都在处理时发生信号而导致的竞争条件停了下来。
[*] The actual code is more complicated than the
fragment just shown, because handle_stop_signal( ) also takes
care of the unusual case of the SIGCONT signal being caught, as
well as of the race conditions due to a SIGCONT signal occurring while
all processes in the thread group are being
stopped.
我们假设内核注意到信号的到达并调用前面几节中提到的函数之一来准备应该接收信号的进程的进程描述符。但如果该进程当时没有在 CPU 上运行,内核就会推迟交付任务信号。我们现在转向内核执行的活动,以确保处理进程的待处理信号。
We assume that the kernel noticed the arrival of a signal and invoked one of the functions mentioned in the previous sections to prepare the process descriptor of the process that is supposed to receive the signal. But in case that process was not running on the CPU at that moment, the kernel deferred the task of delivering the signal. We now turn to the activities that the kernel performs to ensure that pending signals of a process are handled.
正如第 4章“从中断和异常返回”一节中提到的,内核在允许进程在用户模式下恢复执行之前检查进程标志的值。因此,内核每次完成处理中断或异常时都会检查挂起信号是否存在。TIF_SIGPENDING
As mentioned in the section "Returning from Interrupts and
Exceptions" in Chapter
4, the kernel checks the value of the TIF_SIGPENDING flag of the process before
allowing the process to resume its execution in User Mode. Thus, the
kernel checks for the existence of pending signals every time it
finishes handling an interrupt or an exception.
为了处理非阻塞挂起信号,内核调用该
do_signal( )函数,该函数接收两个参数:
To handle the nonblocked pending signals, the kernel invokes the
do_signal( ) function, which receives
two parameters:
regsregs保存当前进程的用户态寄存器内容的堆栈区域的地址。
The address of the stack area where the User Mode register contents of the current process are saved.
oldsetoldset函数应该保存阻塞信号的位掩码数组的变量地址。就是NULL如果不需要保存位掩码数组的话。
The address of a variable where the function is supposed to
save the bit mask array of blocked signals. It is NULL if there is no need to save the bit
mask array.
我们对该do_signal(
)功能的描述将集中于信号传递的一般机制;实际的代码包含大量处理竞争条件和其他特殊情况的细节,例如冻结系统、生成核心转储、停止和终止整个线程组等等。我们将悄悄地跳过所有这些细节。
Our description of the do_signal(
) function will focus on the general mechanism of signal
delivery; the actual code is burdened with lots of details dealing with
race conditions and other special cases—such as freezing the system,
generating core dumps, stopping and killing a whole thread group, and so
on. We will quietly skip all these details.
正如已经提到的,该do_signal(
)函数通常仅在 CPU 将返回用户模式时调用。因此,如果中断处理程序调用
do_signal( ),该函数只需返回:
As already mentioned, the do_signal(
) function is usually only invoked when the CPU is going to
return in User Mode. For that reason, if an interrupt handler invokes
do_signal( ), the function simply
returns:
if ((regs->xcs & 3) != 3)
返回1; if ((regs->xcs & 3) != 3)
return 1;如果oldset参数是
NULL,则函数使用字段的地址对其进行初始化current->blocked:
If the oldset parameter is
NULL, the function initializes it
with the address of the current->blocked field:
如果(!oldset)
oldset = &当前->阻塞; if (!oldset)
oldset = ¤t->blocked;该do_signal( )
函数的核心由一个循环组成,该循环重复调用该dequeue_signal( )函数,直到私有和共享挂起信号队列中都没有留下非阻塞挂起信号。返回码dequeue_signal(
)存储在signr
局部变量中。如果它的值为0,则意味着所有待处理信号已被处理并do_signal( )
可以完成。只要返回非零值,就有一个挂起信号等待处理。处理当前信号dequeue_signal(
)后再次调用。do_signal(
)
The heart of the do_signal( )
function consists of a loop that repeatedly invokes the dequeue_signal( ) function until no nonblocked
pending signals are left in both the private and shared pending signal
queues. The return code of dequeue_signal(
) is stored in the signr
local variable. If its value is 0, it means that all pending signals
have been handled and do_signal( )
can finish. As long as a nonzero value is returned, a pending signal is
waiting to be handled. dequeue_signal(
) is invoked again after do_signal(
) handles the current signal.
首先考虑dequeue_signal( )专用挂起信号队列中的所有信号,从编号最低的信号开始,然后是共享队列中的信号。它更新数据结构以指示信号不再待处理并返回其编号。current->pending.signal此任务涉及清除或
中的相应位
current->signal->shared_pending.signal,并调用recalc_sigpending( )以更新标志的值TIF_SIGPENDING。
The dequeue_signal( ) considers
first all signals in the private pending signal queue, starting from the
lowest-numbered signal, then the signals in the shared queue. It updates
the data structures to indicate that the signal is no longer pending and
returns its number. This task involves clearing the corresponding bit in
current->pending.signal or
current->signal->shared_pending.signal,
and invoking recalc_sigpending( ) to
update the value of the TIF_SIGPENDING flag.
让我们看看该do_signal( )
函数如何处理每个挂起的信号,其编号由 返回
dequeue_signal( )。首先,它检查current接收进程是否正在被其他进程监视;在这种情况下,do_signal( )调用do_notify_parent_cldstop( )并使schedule( )监视进程了解信号处理。
Let's see how the do_signal( )
function handles each pending signal whose number is returned by
dequeue_signal( ). First, it checks
whether the current receiver process
is being monitored by some other process; in this case, do_signal( ) invokes do_notify_parent_cldstop( ) and schedule( ) to make the monitoring process
aware of the signal handling.
然后用要处理的信号的数据结构的地址do_signal( )加载
局部变量:kak_sigaction
Then do_signal( ) loads the
ka local variable with the address of
the k_sigaction data structure of the
signal to be handled:
ka = &当前->sig->action[signr-1];
ka = ¤t->sig->action[signr-1];
根据内容,可以执行三种操作: 忽略信号、执行默认操作或执行信号处理程序。
Depending on the contents, three kinds of actions may be performed: ignoring the signal, executing a default action, or executing a signal handler.
当显式忽略传递的信号时,该do_signal( )函数只是继续执行新的循环,因此会考虑另一个待处理信号:
When a delivered signal is explicitly ignored, the do_signal( ) function simply continues with a
new execution of the loop and therefore considers another pending
signal:
if (ka->sa.sa_handler == SIG_IGN)
继续; if (ka->sa.sa_handler == SIG_IGN)
continue;在下面的两节中,我们将描述如何执行默认操作和信号处理程序。
In the following two sections we will describe how a default action and a signal handler are executed.
如果ka->sa.sa_handler等于SIG_DFL,do_signal(
)则必须执行信号的默认操作。唯一的例外是当接收进程是
init时,在这种情况下信号将被丢弃,如前面的“传递信号时执行的操作”部分中所述:
If ka->sa.sa_handler is equal to SIG_DFL, do_signal(
) must perform the default action of the signal. The only
exception comes when the receiving process is
init, in which case the signal is discarded as
described in the earlier section "Actions Performed upon
Delivering a Signal":
if (当前->pid == 1)
继续; if (current->pid == 1)
continue;对于其他进程,默认操作为“忽略”的信号也很容易处理:
For other processes, the signals whose default action is "ignore" are also easily handled:
if (signr==SIGCONT || signr==SIGCHLD ||
Signr==SIGWINCH || 签名==SIGURG)
继续; if (signr==SIGCONT || signr==SIGCHLD ||
signr==SIGWINCH || signr==SIGURG)
continue;默认动作为“停止”的信号可以停止线程组中的所有进程。为此,do_signal( )将它们的状态设置为TASK_STOPPED,然后调用该函数(请参阅第 7 章中的“ schedule( ) 函数”schedule( )部分):
The signals whose default action is "stop" may stop all
processes in the thread group. To do this, do_signal( ) sets their states to TASK_STOPPED and then invokes the schedule( ) function (see the section "The schedule( ) Function"
in Chapter 7):
if (signr==SIGSTOP || signr==SIGTSTP ||
signr==SIGTTIN || 签名==SIGTTOU) {
if (signr!= SIGSTOP &&
is_orphaned_pgrp(当前->信号->pgrp))
继续;
do_signal_stop(signr);
} if (signr==SIGSTOP || signr==SIGTSTP ||
signr==SIGTTIN || signr==SIGTTOU) {
if (signr != SIGSTOP &&
is_orphaned_pgrp(current->signal->pgrp))
continue;
do_signal_stop(signr);
}SIGSTOP和其他信号
之间的区别很微妙:SIGSTOP总是停止线程组,而其他信号仅当线程组不在“孤立进程组”中时才停止线程组。POSIX 标准指定,只要组中的某个进程的父进程位于不同进程组但位于同一会话中,该进程组就不是孤立进程。因此,如果父进程终止,但启动该进程的用户仍然登录,则该进程组不会成为孤立进程。
The difference between SIGSTOP and the other signals is subtle:
SIGSTOP always stops the thread
group, while the other signals stop the thread group only if it is not
in an "orphaned process group." The POSIX standard specifies that a
process group is not orphaned as long as there is
a process in the group that has a parent in a different process group
but in the same session. Thus, if the parent process dies but the user
who started the process is still logged in, the process group is not
orphaned.
该do_signal_stop( )
函数检查current线程组中是否是第一个进程被停止。如果是这样,它会激活“组停止”:本质上,该函数将group_stop_count信号描述符中的字段设置为正值,并唤醒线程组中的每个进程。每个这样的进程依次查看该字段以识别正在进行的组停止,将其状态更改为TASK_STOPPED,并调用schedule()。该do_signal_stop( )函数还会向线程组领导者的父进程发送
SIGCHLD信号,除非父进程设置了
SA_NOCLDSTOP的标志SIGCHLD。
The do_signal_stop( )
function checks whether current is
the first process being stopped in the thread group. If so, it
activates a "group stop": essentially, the function sets the group_stop_count field in the signal
descriptor to a positive value, and awakens each process in the thread
group. Each such process, in turn, looks at this field to recognize
that a group stop is in progress, changes its state to TASK_STOPPED, and invokes schedule(). The do_signal_stop( ) function also sends a
SIGCHLD signal to the parent
process of the thread group leader, unless the parent has set the
SA_NOCLDSTOP flag of SIGCHLD.
默认动作为“dump”的信号可能会core在进程工作目录中创建一个文件;该文件列出了进程地址空间和CPU寄存器的完整内容。do_signal(
)创建核心文件后,它会杀死线程组。其余 18 个信号的默认操作是“终止”,其中包括简单地终止线程组。为了终止整个线程组,该函数调用,它执行一个干净的“组退出”过程(请参阅第 3 章中的“进程终止”do_group_exit(
)部分)。
The signals whose default action is "dump" may create a core file in the process working directory;
this file lists the complete contents of the process's address space
and CPU registers. After do_signal(
) creates the core file, it kills the thread group. The
default action of the remaining 18 signals is "terminate," which
consists of simply killing the thread group. To kill the whole thread
group, the function invokes do_group_exit(
), which executes a clean "group exit" procedure (see the
section "Process
Termination" in Chapter
3).
如果已经为该信号建立了处理程序,则该
do_signal( )函数必须强制其执行。它通过调用来做到这一点handle_signal( ):
If a handler has been established for the signal, the
do_signal( ) function must enforce
its execution. It does this by invoking handle_signal( ):
handle_signal(signr, &info, &ka, oldset, regs);
if (ka->sa.sa_flags & SA_ONESHOT)
ka->sa.sa_handler = SIG_DFL;
返回1; handle_signal(signr, &info, &ka, oldset, regs);
if (ka->sa.sa_flags & SA_ONESHOT)
ka->sa.sa_handler = SIG_DFL;
return 1;如果接收到的信号SA_ONESHOT设置了标志,则必须将其重置为默认操作,以便同一信号的进一步出现不会再次触发信号处理程序的执行。注意
do_signal( )处理单个信号后如何返回。在下次调用 之前,不会考虑其他挂起的信号do_signal(
)。这种方法确保实时信号将以正确的顺序处理。
If the received signal has the SA_ONESHOT flag set, it must be reset to its
default action, so that further occurrences of the same signal will
not trigger again the execution of the signal handler. Notice how
do_signal( ) returns after having
handled a single signal. Other pending signals won't be considered
until the next invocation of do_signal(
). This approach ensures that real-time signals will be
dealt with in the proper order.
执行信号处理程序是一项相当复杂的任务,因为在用户模式和内核模式之间切换时需要仔细处理堆栈。我们准确地解释了这里的含义:
Executing a signal handler is a rather complex task because of the need to juggle stacks carefully while switching between User Mode and Kernel Mode. We explain exactly what is entailed here:
信号处理程序是由用户模式进程定义的函数,包含在用户模式代码段中。该handle_signal( )函数在内核模式下运行,而信号处理程序在用户模式下运行;这意味着当前进程必须首先在用户模式下执行信号处理程序,然后才能恢复其“正常”执行。此外,当内核尝试恢复进程的正常执行时,内核模式堆栈不再包含被中断程序的硬件上下文,因为每次从用户模式到内核模式的转换时内核模式堆栈都会被清空。
Signal handlers are functions defined by User Mode processes and
included in the User Mode code segment. The handle_signal( ) function runs in Kernel
Mode while signal handlers run in User Mode; this means that the
current process must first execute the signal handler in User Mode
before being allowed to resume its "normal" execution. Moreover, when
the kernel attempts to resume the normal execution of the process, the
Kernel Mode stack no longer contains the hardware context of the
interrupted program, because the Kernel Mode stack is emptied at every
transition from User Mode to Kernel Mode.
另一个复杂之处是信号处理程序可能会调用系统调用。在这种情况下,服务例程执行后,控制必须返回到信号处理程序,而不是返回到被中断程序的正常代码流。
An additional complication is that signal handlers may invoke system calls. In this case, after the service routine executes, control must be returned to the signal handler instead of to the normal flow of code of the interrupted program.
Linux中采用的解决方案是将内核态堆栈中保存的硬件上下文复制到当前进程的用户态堆栈中。用户模式堆栈也进行了修改,当信号处理程序终止时,sigreturn( ) 自动调用系统调用将硬件上下文复制回内核模式堆栈并恢复用户模式堆栈的原始内容。
The solution adopted in Linux consists of copying the hardware
context saved in the Kernel Mode stack onto the User Mode stack of the
current process. The User Mode stack is also modified in such a way
that, when the signal handler terminates, the sigreturn( ) system call is automatically invoked to copy the
hardware context back on the Kernel Mode stack and to restore the
original content of the User Mode stack.
图11-2
说明了捕获中涉及的函数的执行流程一个信号。非阻塞信号被发送到进程。当发生中断或异常时,进程切换到内核模式。在返回用户模式之前,内核执行该do_signal( )函数,该函数依次处理信号(通过调用handle_signal( ))并设置用户模式堆栈(通过调用setup_frame( )
或setup_rt_frame( ))。当进程再次切换到用户模式时,它开始执行信号处理程序,因为处理程序的起始地址被强制存入程序计数器。setup_frame(
)当该函数终止时,将执行or函数放置在用户模式堆栈上的返回代码setup_rt_frame( )
。这段代码调用了sigreturn( ) 或者rt_sigreturn(
) 系统调用;相应的服务例程将正常程序的硬件上下文复制到内核模式堆栈,并将用户模式堆栈恢复到其原始状态(通过调用restore_sigcontext( ))。当系统调用终止时,正常程序可以恢复执行。
Figure 11-2
illustrates the flow of execution of the functions involved in
catching a signal. A nonblocked signal is sent to a process.
When an interrupt or exception occurs, the process switches into
Kernel Mode. Right before returning to User Mode, the kernel executes
the do_signal( ) function, which in
turn handles the signal (by invoking handle_signal( )) and sets up the User Mode
stack (by invoking setup_frame( )
or setup_rt_frame( )). When the
process switches again to User Mode, it starts executing the signal
handler, because the handler's starting address was forced into the
program counter. When that function terminates, the return code placed
on the User Mode stack by the setup_frame(
) or setup_rt_frame( )
function is executed. This code invokes the sigreturn( ) or the rt_sigreturn(
) system call; the corresponding service routines copy
the hardware context of the normal program to the Kernel Mode stack
and restore the User Mode stack back to its original state (by
invoking restore_sigcontext( )).
When the system call terminates, the normal program can thus resume
its execution.
现在让我们详细研究一下这个方案是如何执行的。
Let's now examine in detail how this scheme is carried out.
为了正确设置进程的用户模式堆栈,该handle_signal( )函数调用setup_frame( )
(对于不需要表的信号;请参阅本章后面的“与信号处理相关的系统调用siginfo_t”部分)或(对于需要表的信号) )。为了在这两个函数之间进行选择,内核检查与信号关联的表字段中的标志值
。setup_rt_frame( )siginfo_tSA_SIGINFOsa_flagssigaction
To properly set the User Mode stack of the process,
the handle_signal( ) function
invokes either setup_frame( )
(for signals that do not require a siginfo_t table; see the section "System Calls Related to Signal
Handling" later in this chapter) or setup_rt_frame( ) (for signals that do
require a siginfo_t table). To
choose among these two functions, the kernel checks the value of the
SA_SIGINFO flag in the sa_flags field of the sigaction table associated with the
signal.
该setup_frame( )函数接收四个参数,其含义如下:
The setup_frame( ) function
receives four parameters, which have the following meanings:
sigsig信号数量
Signal number
kakak_sigaction与信号相关的表的地址
Address of the k_sigaction table associated with
the signal
oldsetoldset阻塞信号位掩码数组的地址
Address of a bit mask array of blocked signals
regsregs保存用户模式寄存器内容的内核模式堆栈区域的地址
Address in the Kernel Mode stack area where the User Mode register contents are saved
该函数将一个称为帧的setup_frame( )数据结构推送到用户模式堆栈上
,其中包含处理信号和确保正确返回函数所需的信息。框架是一个
包含以下字段的表(见图11-3):sys_sigreturn( )sigframe
The setup_frame( ) function
pushes onto the User Mode stack a data structure called a
frame, which contains the information needed to
handle the signal and to ensure the correct return to the sys_sigreturn( ) function. A frame is a
sigframe table that includes the
following fields (see Figure 11-3):
pretcodepretcode信号处理函数的返回地址;它指向_
_kernel_sigreturn标签上的代码(见下文)。
Return address of the signal handler function; it points
to the code at the _
_kernel_sigreturn label (see below).
sigsig信号编号;这是信号处理程序所需的参数。
The signal number; this is the parameter required by the signal handler.
scscS包含用户模式进程
sigcontext在切换到内核模式之前的硬件上下文的类型结构(此信息是从内核模式堆栈复制的current)。它还包含一个位数组,指定进程的阻塞常规信号。
S tructure of type
sigcontext containing the
hardware context of the User Mode process right before
switching to Kernel Mode (this information is copied from the
Kernel Mode stack of current). It also contains a bit
array that specifies the blocked regular signals of the
process.
fpstatefpstate_fpstate可用于存储用户模式进程的浮点寄存器的类型结构(请参阅第 3 章中的“保存和加载 FPU、MMX 和 XMM 寄存器”部分)。
Structure of type _fpstate that may be used to store
the floating point registers of the User Mode process (see the
section "Saving
and Loading the FPU, MMX, and XMM Registers" in Chapter 3).
extramaskextramask指定被阻止的实时信号的位数组。
Bit array that specifies the blocked real-time signals.
retcoderetcode8 字节代码发出sigreturn( ) 系统调用。在早期版本的 Linux 中,该代码被有效执行以从信号处理程序返回;然而,在 Linux 2.6 中,它仅用作签名,以便调试器可以识别信号堆栈帧。
8-byte code issuing a sigreturn( ) system call. In earlier versions of Linux, this
code was effectively executed to return from the signal
handler; in Linux 2.6, however, it is used only as a
signature, so that debuggers can recognize the signal stack
frame.
该setup_frame( )函数首先调用get_sigframe(
)计算帧的第一个内存位置。该内存位置通常是用户模式堆栈中的[ * ],因此该函数返回值:
The setup_frame( ) function
starts by invoking get_sigframe(
) to compute the first memory location of the frame. That
memory location is usually[*] in the User Mode stack, so the function returns the
value:
(regs->esp - sizeof(struct sigframe)) & 0xfffffff8
(regs->esp - sizeof(struct sigframe)) & 0xfffffff8
由于堆栈向低地址增长,因此帧的初始地址是通过当前堆栈顶部的地址减去其大小并将结果与 8 的倍数对齐来获得的。
Because stacks grow toward lower addresses, the initial address of the frame is obtained by subtracting its size from the address of the current stack top and aligning the result to a multiple of 8.
然后通过access_ok宏来验证返回的地址;如果有效,则该函数重复调用_ _put_user(
)以填充帧的所有字段。pretcode帧中的字段被初始化为,&_ _kernel_sigreturn即放置在 vsyscall 页中的某些粘合代码的地址(请参见第 10 章中的“通过 sysenter 指令发出系统调用”部分)。
The returned address is then verified by means of the access_ok macro; if it is valid, the
function repeatedly invokes _ _put_user(
) to fill all the fields of the frame. The pretcode field in the frame is initialized
to &_ _kernel_sigreturn, the
address of some glue code placed in the vsyscall page (see the
section "Issuing a
System Call via the sysenter Instruction" in Chapter 10).
完成此操作后,该函数将修改regs内核模式堆栈的区域,从而确保
current在用户模式下恢复执行时将控制权转移到信号处理程序:
Once this is done, the function modifies the regs area of the Kernel Mode stack, thus
ensuring that control is transferred to the signal handler when
current resumes its execution in
User Mode:
regs->esp = (无符号长) 帧;
regs->eip = (unsigned long) ka->sa.sa_handler;
regs->eax = (无符号长) sig;
regs->edx = regs->ecx = 0;
regs->xds = regs->xes = regs->xss = _ _USER_DS;
regs->xcs = _ _USER_CS; regs->esp = (unsigned long) frame;
regs->eip = (unsigned long) ka->sa.sa_handler;
regs->eax = (unsigned long) sig;
regs->edx = regs->ecx = 0;
regs->xds = regs->xes = regs->xss = _ _USER_DS;
regs->xcs = _ _USER_CS;该setup_frame( )函数通过将内核模式堆栈上保存的分段寄存器重置为其默认值来终止。现在信号处理程序所需的信息位于用户模式堆栈的顶部。
The setup_frame( ) function
terminates by resetting the segmentation registers saved on the
Kernel Mode stack to their default value. Now the information needed
by the signal handler is on the top of the User Mode stack.
该setup_rt_frame( )
函数与 类似setup_frame(
),但它在用户模式堆栈上放置一个
扩展帧(存储在数据结构中),其中还包括与信号关联的表rt_sigframe的内容。siginfo_t此外,该函数设置该pretcode字段,使其指向
_ _kernel_rt_sigreturnvsyscall 页面中的代码。
The setup_rt_frame( )
function is similar to setup_frame(
), but it puts on the User Mode stack an
extended frame (stored in the rt_sigframe data structure) that also
includes the content of the siginfo_t table associated with the
signal. Moreover, this function sets the pretcode field so that it points to the
_ _kernel_rt_sigreturn code in
the vsyscall page.
设置用户模式堆栈后,该handle_signal( )函数检查与信号关联的标志的值。如果信号没有设置标志,则在信号处理程序执行期间必须阻塞表字段中SA_NODEFER的信号:sa_masksigaction
After setting up the User Mode stack, the handle_signal( ) function checks the
values of the flags associated with the signal. If the signal does
not have the SA_NODEFER flag set,
the signals in the sa_mask field
of the sigaction table must be
blocked during the execution of the signal handler:
if (!(ka->sa.sa_flags & SA_NODEFER)) {
spin_lock_irq(¤t->sighand->siglock);
sigorsets(&当前->被阻止, &当前->被阻止, &ka->sa.sa_mask);
sigaddset(&当前->阻塞, sig);
recalc_sigpending(当前);
spin_unlock_irq(¤t->sighand->siglock);
} if (!(ka->sa.sa_flags & SA_NODEFER)) {
spin_lock_irq(¤t->sighand->siglock);
sigorsets(¤t->blocked, ¤t->blocked, &ka->sa.sa_mask);
sigaddset(¤t->blocked, sig);
recalc_sigpending(current);
spin_unlock_irq(¤t->sighand->siglock);
}如前所述,该recalc_sigpending( )函数检查进程是否具有非阻塞的挂起信号并
TIF_SIGPENDING相应地设置其标志。
As described earlier, the recalc_sigpending( ) function checks
whether the process has nonblocked pending signals and sets its
TIF_SIGPENDING flag
accordingly.
该函数返回 then do_signal( ),它也立即返回。
The function returns then to do_signal( ), which also returns
immediately.
返回时do_signal( )
,当前进程恢复在用户模式下执行。由于setup_frame(
)前面描述的准备工作,eip寄存器指向信号处理程序的第一条指令,同时esp指向已压入用户模式堆栈顶部的帧的第一个内存位置。结果,信号处理程序被执行。
When do_signal( )
returns, the current process resumes its execution in User Mode.
Because of the preparation by setup_frame(
) described earlier, the eip register points to the first
instruction of the signal handler, while esp points to the first memory location of
the frame that has been pushed on top of the User Mode stack. As a
result, the signal handler is executed.
当信号处理程序终止时,堆栈顶部的返回地址指向帧字段引用的 vsyscall 页中的代码pretcode:
When the signal handler terminates, the return address on top
of the stack points to the code in the vsyscall page referenced by
the pretcode field of the
frame:
_ _kernel_sigreturn:
人口%eax
movl $_ _NR_sigreturn, %eax
整数 $0x80 _ _kernel_sigreturn:
popl %eax
movl $_ _NR_sigreturn, %eax
int $0x80因此,信号号(即sig帧的字段)从堆栈中丢弃;这sigreturn( )
然后调用系统调用。
Therefore, the signal number (that is, the sig field of the frame) is discarded from
the stack; the sigreturn( )
system call is then invoked.
该函数计算数据结构sys_sigreturn( )
的地址,其中包含用户模式进程的硬件上下文(请参阅第10章中的“参数传递”
部分)。根据存储在该字段中的值,它可以导出并检查用户模式堆栈内的帧地址:pt_regsregsesp
The sys_sigreturn( )
function computes the address of the pt_regs data structure regs, which contains the hardware context
of the User Mode process (see the section "Parameter Passing" in
Chapter 10). From the
value stored in the esp field, it
can thus derive and check the frame address inside the User Mode
stack:
框架 = (struct sigframe *)(regs.esp - 8);
if (verify_area(VERIFY_READ, 框架, sizeof(*frame)) {
force_sig(SIGSEGV, 当前);
返回0;
} frame = (struct sigframe *)(regs.esp - 8);
if (verify_area(VERIFY_READ, frame, sizeof(*frame)) {
force_sig(SIGSEGV, current);
return 0;
}然后,该函数将调用信号处理程序之前被阻止的信号位数组从sc帧的字段复制到blocked的字段current。结果,所有为执行信号处理程序而被屏蔽的信号都被解除阻塞。recalc_sigpending(
)然后调用该函数。
Then the function copies the bit array of signals that were
blocked before invoking the signal handler from the sc field of the frame to the blocked field of current. As a result, all signals that
have been masked for the execution of the signal handler are
unblocked. The recalc_sigpending(
) function is then invoked.
sys_sigreturn( )
此时,该函数必须将进程硬件上下文从sc帧字段复制到内核模式堆栈,并从用户模式堆栈中删除该帧;它通过调用该函数来执行这两项任务restore_sigcontext( )。
The sys_sigreturn( )
function must at this point copy the process hardware context from
the sc field of the frame to the
Kernel Mode stack and remove the frame from the User Mode stack; it
performs these two tasks by invoking the restore_sigcontext( ) function.
如果信号是由系统调用发送的,例如rt_sigqueueinfo( ) 需要一个siginfo_t表与信号关联,机制类似。扩展帧的字段pretcode指向_
_kernel_rt_sigreturnvsyscall页面中的代码,该页面依次调用rt_sigreturn( )
系统调用;相应的sys_rt_sigreturn( )服务例程将进程硬件上下文从扩展帧复制到内核模式堆栈,并通过从中删除扩展帧来恢复原始用户模式堆栈内容。
If the signal was sent by a system call such as rt_sigqueueinfo( ) that required a siginfo_t table to be associated with the
signal, the mechanism is similar. The pretcode field of the extended frame
points to the _
_kernel_rt_sigreturn code in the vsyscall page, which in
turn invokes the rt_sigreturn( )
system call; the corresponding sys_rt_sigreturn( ) service routine copies
the process hardware context from the extended frame to the Kernel
Mode stack and restores the original User Mode stack content by
removing the extended frame from it.
与系统调用相关的请求并不总是能立即被内核满足;当发生这种情况时,发出系统调用的进程将进入TASK_INTERRUPTIBLEorTASK_UNINTERRUPTIBLE状态。
The request associated with a system call cannot always
be immediately satisfied by the kernel; when this happens, the process
that issued the system call is put in a TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state.
如果进程被置于某种TASK_INTERRUPTIBLE状态并且某个其他进程向它发送信号,则内核将其置于该TASK_RUNNING状态而不完成系统调用(请参阅第 4 章中的“从中断和异常返回”部分)。信号被传递到进程,同时切换回用户模式。发生这种情况时,系统调用服务例程不会完成其工作,而是返回、、、或错误代码。EINTRERESTARTNOHANDERESTART_RESTARTBLOCKERESTARTSYSERESTARTNOINTR
If the process is put in a TASK_INTERRUPTIBLE state and some other
process sends a signal to it, the kernel puts it in the TASK_RUNNING state without completing the
system call (see the section "Returning from Interrupts and
Exceptions" in Chapter
4). The signal is delivered to the process while switching back
to User Mode. When this happens, the system call service routine does
not complete its job, but returns an EINTR, ERESTARTNOHAND, ERESTART_RESTARTBLOCK, ERESTARTSYS, or ERESTARTNOINTR error code.
实际上,在这种情况下用户模式进程可以获得的唯一错误代码是EINTR,这意味着系统调用尚未完成。(应用程序程序员可以检查此代码并决定是否重新发出系统调用。)其余的错误代码由内核在内部使用,以指定在信号处理程序终止后是否可以自动重新执行系统调用。
In practice, the only error code a User Mode process can get in
this situation is EINTR, which
means that the system call has not been completed. (The application
programmer may check this code and decide whether to reissue the
system call.) The remaining error codes are used internally by the
kernel to specify whether the system call may be reexecuted
automatically after the signal handler termination.
表 11-11 列出了与未完成的系统调用相关的错误代码及其对三个可能的信号操作中每一个的影响。条目中出现的术语在以下列表中定义:
Table 11-11 lists the error codes related to unfinished system calls and their impact for each of the three possible signal actions. The terms that appear in the entries are defined in the following list:
系统调用不会自动重新执行;该进程将在用户模式下按照以下指令恢复执行int
$0x80或者
sysenter 1 并且eax寄存器将包含该-EINTR值。
The system call will not be automatically reexecuted; the
process will resume its execution in User Mode at the
instruction following the int
$0x80 or
sysenter one and the eax register will contain the -EINTR value.
内核强制用户态进程重新加载
eax系统调用号寄存器并重新执行int
$0x80orsysenter
指令;该进程不知道重新执行,并且错误代码不会传递给它。
The kernel forces the User Mode process to reload the
eax register with the system
call number and to reexecute the int
$0x80 or sysenter
instruction; the process is not aware of the reexecution and the
error code is not passed to it.
SA_RESTART仅当传递信号的标志被设置时,系统调用才会被重新执行;否则,系统调用将终止并显示
-EINTR错误代码。
The system call is reexecuted only if the SA_RESTART flag of the delivered
signal is set; otherwise, the system call terminates with a
-EINTR error code.
表 11-11。重新执行系统调用
Table 11-11. Reexecution of system calls
错误代码及其对系统调用执行的影响 Error codes and their impact on system call execution | ||||
|---|---|---|---|---|
信号 Signal 行动 Action | EINTR EINTR | 启动系统 ERESTARTSYS | 立即开始无手 ERESTARTNOHAND ERESTART_RESTARTBLOCK a ERESTART_RESTARTBLOCK a | ERESTARTNOINTR ERESTARTNOINTR |
a和错误代码因用于重新启动系统调用 a The | ||||
默认 Default | 终止 Terminate | 重新执行 Reexecute | 重新执行 Reexecute | 重新执行 Reexecute |
忽略 Ignore | 终止 Terminate | 重新执行 Reexecute | 重新执行 Reexecute | 重新执行 Reexecute |
抓住 Catch | 终止 Terminate | 依靠 Depends | 终止 Terminate | 重新执行 Reexecute |
当传递信号时,内核必须确保进程在尝试重新执行之前确实发出了系统调用。这就是硬件环境orig_eax领域regs发挥关键作用的地方。让我们回想一下当中断或异常处理程序启动时该字段是如何初始化的:
When delivering a signal, the kernel must be sure that the
process really issued a system call before attempting to reexecute it.
This is where the orig_eax field of
the regs hardware context plays a
critical role. Let's recall how this field is initialized when the
interrupt or exception handler starts:
该字段包含与中断相关的 IRQ 号负 256(请参阅第 4 章中的“保存中断处理程序的寄存器”部分)。
The field contains the IRQ number associated with the interrupt minus 256 (see the section "Saving the registers for the interrupt handler" in Chapter 4).
0x80例外(也
sysenter)0x80 exception (also
sysenter)The field contains the system call number (see the section "Entering and Exiting a System Call" in Chapter 10).
该字段包含值 -1(请参阅第 4 章中的“保存异常处理程序的寄存器”部分)。
The field contains the value -1 (see the section "Saving the Registers for the Exception Handler" in Chapter 4).
因此,该字段中的非负值orig_eax意味着该信号已唤醒TASK_INTERRUPTIBLE
在系统调用中休眠的进程。服务例程识别出系统调用被中断,因此返回前面提到的错误代码之一。
Therefore, a nonnegative value in the orig_eax field means that the signal has
woken up a TASK_INTERRUPTIBLE
process that was sleeping in a system call. The service routine
recognizes that the system call was interrupted, and thus returns one
of the previously mentioned error codes.
如果显式忽略该信号或者强制执行其默认操作,则do_signal( )
分析系统调用的错误代码以决定是否必须自动重新执行未完成的系统调用,如表 11-11所示。如果必须重新启动调用,则该函数会修改regs硬件上下文,以便当进程返回用户模式时,eip指向该int $0x80指令或该sysenter指令,并eax包含系统调用号:
If the signal is explicitly ignored or if its default
action is enforced, do_signal( )
analyzes the error code of the system call to decide whether the
unfinished system call must be automatically reexecuted, as
specified in Table
11-11. If the call must be restarted, the function modifies
the regs hardware context so
that, when the process is back in User Mode, eip points either to the int $0x80 instruction or to the sysenter instruction, and eax contains the system call
number:
if (regs->orig_eax >= 0) {
if (regs->eax == -ERESTARTNOHAND || regs->eax == -ERESTARTSYS ||
regs->eax == -ERESTARTNOINTR) {
regs->eax = regs->orig_eax;
regs->eip -= 2;
}
if (regs->eax == -ERESTART_RESTARTBLOCK) {
regs->eax = _ _NR_restart_syscall;
regs->eip -= 2;
}
} if (regs->orig_eax >= 0) {
if (regs->eax == -ERESTARTNOHAND || regs->eax == -ERESTARTSYS ||
regs->eax == -ERESTARTNOINTR) {
regs->eax = regs->orig_eax;
regs->eip -= 2;
}
if (regs->eax == -ERESTART_RESTARTBLOCK) {
regs->eax = _ _NR_restart_syscall;
regs->eip -= 2;
}
}该regs->eax字段填充系统调用服务例程的返回码(参见第10章“进入和退出系统调用”部分)。请注意, 和指令都是两个字节长,因此函数从中减去 2,以便将其设置为触发系统调用的指令。int $0x80sysreturneip
The regs->eax field is
filled with the return code of a system call service routine (see
the section "Entering and
Exiting a System Call" in Chapter 10). Notice that both
the int $0x80 and sysreturn instructions are two bytes long
so the function subtracts 2 from eip in order to set it to the instruction
that triggers the system call.
错误代码ERESTART_RESTARTBLOCK比较特殊,因为eax寄存器被设置为错误代码的编号。restart_syscall( )
系统调用;因此,用户模式进程不会重新启动被信号中断的同一系统调用。此错误代码仅由与时间相关的系统调用使用,这些系统调用在重新启动时应调整其用户模式参数。一个典型的例子是nanosleep( )
系统调用(参见第6章中的“动态定时器的应用:nanosleep()系统调用”部分):假设一个进程调用它来暂停执行20毫秒,并且10毫秒后出现一个信号。如果系统调用照常重新启动,则总延迟时间将超过 30 毫秒。
The error code ERESTART_RESTARTBLOCK is special, because
the eax register is set to the
number of the restart_syscall( )
system call; thus, the User Mode process does not
restart the same system call that was interrupted by the signal.
This error code is only used by time-related system calls that, when
restarted, should adjust their User Mode parameters. A typical
example is the nanosleep( )
system call (see the section "An Application of Dynamic
Timers: the nanosleep( ) System Call" in Chapter 6): suppose that a
process invokes it to pause the execution for 20 milliseconds, and
that a signal occurs 10 milliseconds later. If the system call would
be restarted as usual, the total delay time would exceed 30
milliseconds.
nanosleep( )相反,系统调用
的服务例程用重新启动时要使用的特殊服务例程的地址填充
's结构restart_block中的字段
,如果中断则返回。服务例程仅执行特殊的服务例程,该例程调整延迟以考虑原始系统调用的调用和重新启动之间经过的时间。currentthread_info-ERESTART_RESTARTBLOCKsys_restart_syscall(
)nanosleep( )
Instead, the service routine of the nanosleep( ) system call fills the
restart_block field in the
current's thread_info structure with the address of
a special service routine to be used when restarting, and returns
-ERESTART_RESTARTBLOCK if
interrupted. The sys_restart_syscall(
) service routine just executes the special nanosleep( )'s service routine, which
adjusts the delay to consider the time elapsed between the
invocation of the original system call and its restarting.
如果捕获到该信号,handle_signal( )则分析错误代码,并可能分析SA_RESTART
表的标志,sigaction以决定是否必须重新执行未完成的系统调用:
If the signal is caught, handle_signal( ) analyzes the error code
and, possibly, the SA_RESTART
flag of the sigaction table to
decide whether the unfinished system call must be reexecuted:
if (regs->orig_eax >= 0) {
开关(regs->eax){
案例-ERESTART_RESTARTBLOCK:
案例-ERESTARTNOHAND:
regs->eax = -EINTR;
休息;
案例-ERESTARTSYS:
if (!(ka->sa.sa_flags & SA_RESTART)) {
regs->eax = -EINTR;
休息;
}
/* 失败 */
案例-ERESTARTNOINTR:
regs->eax = regs->orig_eax;
regs->eip -= 2;
}
} if (regs->orig_eax >= 0) {
switch (regs->eax) {
case -ERESTART_RESTARTBLOCK:
case -ERESTARTNOHAND:
regs->eax = -EINTR;
break;
case -ERESTARTSYS:
if (!(ka->sa.sa_flags & SA_RESTART)) {
regs->eax = -EINTR;
break;
}
/* fallthrough */
case -ERESTARTNOINTR:
regs->eax = regs->orig_eax;
regs->eip -= 2;
}
}如果必须重新启动系统调用,则handle_signal( )完全按照以下方式进行
do_signal( ):否则,它会-EINTR向用户模式进程返回一个错误代码。
If the system call must be restarted, handle_signal( ) proceeds exactly as
do_signal( ); otherwise, it
returns an -EINTR error code to
the User Mode process.
[ * ] Linux 允许进程通过调用系统调用为其信号处理程序指定替代堆栈signaltstack( );X/Open 标准也要求此功能。当存在替代堆栈时,该get_sigframe( )函数返回该堆栈内的地址。我们不会进一步讨论此功能,因为它在概念上类似于常规信号处理。
[*] Linux allows processes to specify an alternative stack for
their signal handlers by invoking the signaltstack( ) system call; this
feature is also required by the X/Open standard. When an
alternative stack is present, the get_sigframe( ) function returns an
address inside that stack. We don't discuss this feature
further, because it is conceptually similar to regular signal
handling.
正如本章介绍中所述,运行在用户模式下的程序可以发送和接收信号。这意味着必须定义一组系统调用来允许这些类型的操作。不幸的是,由于历史原因,存在几个基本上具有相同目的的系统调用。因此,其中一些系统调用永远不会被调用。例如sys_sigaction( )和sys_rt_sigaction( )几乎相同,所以sigaction( ) C 库中包含的包装函数最终会调用sys_rt_sigaction( )而不是调用sys_sigaction( ). 我们将在以下部分中描述一些最重要的系统调用。
As stated in the introduction of this chapter, programs
running in User Mode are allowed to send and receive signals. This means
that a set of system calls must be defined to allow these kinds of
operations. Unfortunately, for historical reasons, several system calls
exist that serve essentially the same purpose. As a result, some of
these system calls are never invoked. For instance, sys_sigaction( ) and sys_rt_sigaction( ) are almost identical, so
the sigaction( ) wrapper function included in the C library ends up
invoking sys_rt_sigaction( ) instead
of sys_sigaction( ). We will describe
some of the most significant system calls in the following
sections.
系统kill(pid,sig)
调用通常用于向常规进程或多线程应用程序发送信号;其对应的服务程序就是sys_kill( )函数。整数pid参数有多种含义,具体取决于其数值:
The kill(pid,sig)
system call is commonly used to send signals to conventional processes
or multithreaded applications; its corresponding service routine is
the sys_kill( ) function. The
integer pid parameter has several
meanings, depending on its numerical value:
该sig信号被发送到 PID 等于 的进程的线程组
pid。
The sig signal is sent
to the thread group of the process whose PID is equal to
pid.
该sig信号被发送到与调用进程位于同一进程组中的进程的所有线程组。
The sig signal is sent
to all thread groups of the processes in the same process group
as the calling process.
该信号被发送到除
swapper (PID 0)、init
(PID 1) 和之外的所有进程current。
The signal is sent to all processes, except
swapper (PID 0), init
(PID 1), and current.
该信号被发送到进程组pid中进程的所有线程组。
The signal is sent to all thread groups of the processes in the process group -pid.
该函数为信号sys_kill( )设置一个最小表,然后调用:siginfo_tkill_something_info( )
The sys_kill( ) function sets
up a minimal siginfo_t table for
the signal, and then invokes kill_something_info( ):
信息.si_signo = sig;
信息.si_errno = 0;
信息.si_code = SI_USER;
info._sifields._kill._pid = 当前->tgid;
info._sifields._kill._uid = 当前->uid;
返回kill_something_info(sig, &info, pid); info.si_signo = sig;
info.si_errno = 0;
info.si_code = SI_USER;
info._sifields._kill._pid = current->tgid;
info._sifields._kill._uid = current->uid;
return kill_something_info(sig, &info, pid);该kill_something_info( )
函数依次调用kill_proc_info( )(通过 向单个线程组发送信号group_send_sig_info(
)),或kill_pg_info( )
(扫描目标进程组中的所有进程并
send_sig_info( )为每个进程调用),或group_send_sig_info(
)对系统中的每个进程重复调用(如果pid是-1)。
The kill_something_info( )
function, in turn, invokes either kill_proc_info( ) (to send the signal to a
single thread group via group_send_sig_info(
)), or kill_pg_info( )
(to scan all processes in the destination process group and invoke
send_sig_info( ) for each of them),
or repeatedly group_send_sig_info(
) for each process in the system (if pid is -1).
系统kill( )调用能够发送所有信号,甚至是编号从 32 到 64 的所谓实时信号。但是,正如我们在前面的“生成信号”部分中看到的,系统调用并不能kill( )确保新元素被添加到目标进程的挂起信号队列中,因此挂起信号的多个实例可能会丢失。实时信号应通过系统调用来发送,例如rt_sigqueueinfo(
) (参见后面的“实时信号的系统调用”部分)。
The kill( ) system call is
able to send every signal, even the so-called real-time signals that
have numbers ranging from 32 to 64. However, as we saw in the earlier
section "Generating a
Signal," the kill( ) system
call does not ensure that a new element is added to the pending signal
queue of the destination process, so multiple instances of pending
signals can be lost. Real-time signals should be sent by means of a
system call such as rt_sigqueueinfo(
) (see the later section "System Calls for Real-Time
Signals").
系统V和BSDUnix 变体也有一个killpg( ) 系统调用,它能够显式地向一组进程发送信号。在Linux中,该函数被实现为使用系统调用的库函数kill(
)。另一种变化是raise( ) ,它向当前进程(即执行该函数的进程)发送信号。在 Linux 中,raise()作为库函数实现。
System V and BSD Unix variants also have a killpg( ) system call, which is able to explicitly send a signal
to a group of processes. In Linux, the function is implemented as a
library function that uses the kill(
) system call. Another variation is raise( ) , which sends a signal to the current process (that is,
to the process executing the function). In Linux, raise() is implemented as a library
function.
和系统调用向线程组中的特定进程tkill( )发送
tgkill( )信号。pthread_kill( )每个 POSIX 兼容的pthread库的函数都会调用它们中的任何一个来向特定的轻量级进程发送信号。
The tkill( ) and
tgkill( ) system calls send a
signal to a specific process in a thread group. The pthread_kill( ) function of every
POSIX-compliant pthread library invokes either of
them to send a signal to a specific lightweight process.
该系统调用需要两个参数:
要发出信号的进程的tkill( )PID和信号编号。服务例程填充一个表,获取进程描述符地址,进行一些权限检查(例如“ group_send_sig_info()函数”部分中的步骤2中的那些),并调用发送信号。pidsigsys_tkill(
)siginfospecific_send_sig_info( )
The tkill( ) system call
expects two parameters: the PID pid
of the process to be signaled and the signal number sig. The sys_tkill(
) service routine fills a siginfo table, gets the process descriptor
address, makes some permission checks (such as those in step 2 in the
section "The
group_send_sig_info( ) Function"), and invokes specific_send_sig_info( ) to send the
signal.
系统tgkill( )调用有所不同,tkill( )因为它有第三个参数:线程组 IDtgid包含要发出信号的进程的线程组的( )。该sys_tgkill( )服务例程执行与 完全相同的操作sys_tkill(
),但还检查发出信号的进程是否确实属于该线程组tgid。此附加检查解决了当信号发送到正在被终止的进程时发生的竞争条件:如果另一个多线程应用程序足够快地创建轻量级进程,则信号可能会传递到错误的进程。系统
tgkill( )调用解决了这个问题,因为线程组 ID 在多线程应用程序的生命周期中永远不会更改。
The tgkill( ) system call
differs from tkill( ) because it
has a third parameter: the thread group ID (tgid) of the thread
group that includes the process to be signaled. The sys_tgkill( ) service routine performs
exactly the same operations as sys_tkill(
), but also checks that the process being signaled actually
belongs to the thread group tgid.
This additional check solves a race condition that occurs when a
signal is sent to a process that is being killed: if another
multithreaded application is creating lightweight processes fast
enough, the signal could be delivered to the wrong process. The
tgkill( ) system call solves the
problem, because the thread group ID is never changed during the life
span of a multithreaded application.
系统sigaction(sig,act,oact)
调用允许用户指定信号的动作;当然,如果没有定义信号操作,内核将执行与传递的信号关联的默认操作。
The sigaction(sig,act,oact)
system call allows users to specify an action for a signal; of course,
if no signal action is defined, the kernel executes the default action
associated with the delivered signal.
相应的sys_sigaction(
)服务例程作用于两个参数:sig信号号和指定新动作的act类型表。old_sigaction第三个oact可选输出参数可用于获取与该信号关联的先前操作。(该数据结构包含与前面部分“与信号关联的数据结构old_sigaction”中描述的结构相同的字段,但顺序不同。)sigaction
The corresponding sys_sigaction(
) service routine acts on two parameters: the sig signal number and the act table of type old_sigaction that specifies the new action.
A third oact optional output
parameter may be used to get the previous action associated with the
signal. (The old_sigaction data
structure contains the same fields as the sigaction structure described in the earlier
section "Data Structures
Associated with Signals," but in a different order.)
该函数首先检查act地址是否有效。然后它用 的相应字段
填充
类型局部变量的sa_handler、sa_flags和字段:sa_masknew_kak_sigaction*act
The function checks first whether the act address is valid. Then it fills the
sa_handler, sa_flags, and sa_mask fields of a new_ka local variable of type k_sigaction with the corresponding fields of
*act:
_ _get_user(new_ka.sa.sa_handler, &act->sa_handler);
_ _get_user(new_ka.sa.sa_flags, &act->sa_flags);
_ _get_user(掩码, &act->sa_mask);
siginitset(&new_ka.sa.sa_mask, 掩码); _ _get_user(new_ka.sa.sa_handler, &act->sa_handler);
_ _get_user(new_ka.sa.sa_flags, &act->sa_flags);
_ _get_user(mask, &act->sa_mask);
siginitset(&new_ka.sa.sa_mask, mask);该函数调用do_sigaction(
)将新new_ka
表复制到信号编号sig-1
位置current->sig->action
(比数组中位置高 1 的条目中,因为没有零信号):
The function invokes do_sigaction(
) to copy the new new_ka
table into the entry at the sig-1
position of current->sig->action
( the number of the signal is one higher than the position
in the array because there is no zero signal):
k = &当前->sig->动作[sig-1];
如果(行动){
*k = *行动;
sigdelsetmask(&k->sa.sa_mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
if (k->sa.sa_handler == SIG_IGN || (k->sa.sa_handler == SIG_DFL &&
(sig==SIGCONT || sig==SIGCHLD || sig==SIGWINCH || sig==SIGURG))) {
rm_from_queue(sigmask(sig), &当前->信号->shared_pending);
t = 当前;
做 {
rm_from_queue(sigmask(sig), &当前->待处理);
recalc_sigpending_tsk(t);
t = next_thread(t);
while (t != 当前);
}
} k = ¤t->sig->action[sig-1];
if (act) {
*k = *act;
sigdelsetmask(&k->sa.sa_mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
if (k->sa.sa_handler == SIG_IGN || (k->sa.sa_handler == SIG_DFL &&
(sig==SIGCONT || sig==SIGCHLD || sig==SIGWINCH || sig==SIGURG))) {
rm_from_queue(sigmask(sig), ¤t->signal->shared_pending);
t = current;
do {
rm_from_queue(sigmask(sig), ¤t->pending);
recalc_sigpending_tsk(t);
t = next_thread(t);
} while (t != current);
}
}POSIX 标准要求将信号操作设置为 或SIG_IGN当SIG_DFL默认操作为“忽略”时,会导致相同类型的每个待处理信号被丢弃。此外,请注意,无论信号处理程序请求的屏蔽信号是什么,SIGKILL都
SIGSTOP永远不会被屏蔽。
The POSIX standard requires that setting a signal action to
either SIG_IGN or SIG_DFL when the default action is "ignore"
causes every pending signal of the same type to be discarded.
Moreover, notice that no matter what the requested masked signals are
for the signal handler, SIGKILL and
SIGSTOP are never masked.
系统sigaction( )调用还允许用户初始化表sa_flags中的字段sigaction。我们在表 11-6(本章前面)中列出了该字段允许的值以及相关含义。
The sigaction( ) system call
also allows the user to initialize the sa_flags field in the sigaction table. We listed the values
allowed for this field and the related meanings in Table 11-6 (earlier in
this chapter).
较旧的系统 VUnix 变体提供了signal( ) 系统调用,至今仍然被程序员广泛使用。最近的 C 库signal(
)通过以下方式实现rt_sigaction(
) 。然而,Linux 仍然支持旧的 C 库并提供sys_signal( )服务例程:
Older System V Unix variants offered the signal( ) system call, which is still widely used by programmers.
Recent C libraries implement signal(
) by means of rt_sigaction(
) . However, Linux still supports older C libraries and
offers the sys_signal( ) service
routine:
new_sa.sa.sa_handler = 处理程序;
new_sa.sa.sa_flags = SA_ONESHOT | SA_NOMASK;
ret = do_sigaction(sig, &new_sa, &old_sa);
返回 ret ? ret : (无符号长)old_sa.sa.sa_handler; new_sa.sa.sa_handler = handler;
new_sa.sa.sa_flags = SA_ONESHOT | SA_NOMASK;
ret = do_sigaction(sig, &new_sa, &old_sa);
return ret ? ret : (unsigned long)old_sa.sa.sa_handler;这sigpending( )
系统调用允许进程检查一组待处理的阻塞信号,即那些在阻塞时引发的信号。相应的sys_sigpending(
)服务例程作用于单个参数,set即必须复制位数组的用户变量的地址:
The sigpending( )
system call allows a process to examine the set of
pending blocked signals—i.e., those that have been raised while
blocked. The corresponding sys_sigpending(
) service routine acts on a single parameter, set, namely, the address of a user variable
where the array of bits must be copied:
sigorsets(&pending, ¤t->pending.signal,
&当前->信号->shared_pending.signal);
sigandsets(&pending, ¤t->blocked, &pending);
copy_to_user(设置, &pending, 4); sigorsets(&pending, ¤t->pending.signal,
¤t->signal->shared_pending.signal);
sigandsets(&pending, ¤t->blocked, &pending);
copy_to_user(set, &pending, 4);这sigprocmask( )
系统调用允许进程修改阻塞信号集;它仅适用于常规(非实时)信号。相应的sys_sigprocmask(
)服务例程作用于三个参数:
The sigprocmask( )
system call allows processes to modify the set of
blocked signals; it applies only to regular (non-real-time) signals.
The corresponding sys_sigprocmask(
) service routine acts on three parameters:
osetoset进程地址空间中指向必须存储前一个位掩码的位数组的指针。
Pointer in the process address space to a bit array where the previous bit mask must be stored.
setset进程地址空间中指向包含新位掩码的位数组的指针。
Pointer in the process address space to the bit array containing the new bit mask.
howhow可能具有以下值之一的标志:
SIG_BLOCK位*set掩码数组指定必须添加到阻塞信号的位掩码数组中的信号。
SIG_UNBLOCK位*set掩码数组指定必须从阻塞信号的位掩码数组中删除的信号。
SIG_SETMASK位掩码数组*set指定阻塞信号的新位掩码数组。
Flag that may have one of the following values:
SIG_BLOCKThe *set bit mask
array specifies the signals that must be added to the bit
mask array of blocked signals.
SIG_UNBLOCKThe *set bit mask
array specifies the signals that must be removed from the
bit mask array of blocked signals.
SIG_SETMASKThe *set bit mask
array specifies the new bit mask array of blocked
signals.
该函数调用copy_from_user(
)将参数指向的值复制到局部变量set中new_set,并将标准阻塞信号的位掩码数组复制current到old_set局部变量中。然后它充当
how这两个变量上指定的标志:
The function invokes copy_from_user(
) to copy the value pointed to by the set parameter into the new_set local variable and copies the bit
mask array of standard blocked signals of current into the old_set local variable. It then acts as the
how flag specifies on these two
variables:
if (copy_from_user(&new_set, set, sizeof(*set)))
返回-EFAULT;
new_set &= ~(sigmask(SIGKILL)|sigmask(SIGSTOP));
old_set = current->blocked.sig[0];
如果(如何== SIG_BLOCK)
sigaddsetmask(&当前->阻塞, new_set);
否则如果(如何== SIG_UNBLOCK)
sigdelsetmask(¤t->blocked, new_set);
否则如果(如何== SIG_SETMASK)
当前->blocked.sig[0] = new_set;
别的
返回-EINVAL;
recalc_sigpending(当前);
if (oset && copy_to_user(oset, &old_set, sizeof(*oset)))
返回-EFAULT;
返回0; if (copy_from_user(&new_set, set, sizeof(*set)))
return -EFAULT;
new_set &= ~(sigmask(SIGKILL)|sigmask(SIGSTOP));
old_set = current->blocked.sig[0];
if (how == SIG_BLOCK)
sigaddsetmask(¤t->blocked, new_set);
else if (how == SIG_UNBLOCK)
sigdelsetmask(¤t->blocked, new_set);
else if (how == SIG_SETMASK)
current->blocked.sig[0] = new_set;
else
return -EINVAL;
recalc_sigpending(current);
if (oset && copy_to_user(oset, &old_set, sizeof(*oset)))
return -EFAULT;
return 0;这sigsuspend( )
TASK_INTERRUPTIBLE系统调用在阻止参数指向的位掩码数组指定的标准信号后,将进程置于该状态mask。仅当向进程发送非忽略、非阻塞信号时,该进程才会被唤醒。
The sigsuspend( )
system call puts the process in the TASK_INTERRUPTIBLE state, after having
blocked the standard signals specified by a bit mask array to which
the mask parameter points. The
process will wake up only when a nonignored, nonblocked signal is sent
to it.
相应的sys_sigsuspend(
)服务例程执行这些语句:
The corresponding sys_sigsuspend(
) service routine executes these statements:
掩码 &= ~(sigmask(SIGKILL) | sigmask(SIGSTOP));
保存集=当前->被阻止;
siginitset(&当前->阻塞, 掩码);
recalc_sigpending(当前);
regs->eax = -EINTR;
而(1){
当前->状态 = TASK_INTERRUPTIBLE;
日程( );
if (do_signal(regs, &saveset))
返回-EINTR;
} mask &= ~(sigmask(SIGKILL) | sigmask(SIGSTOP));
saveset = current->blocked;
siginitset(¤t->blocked, mask);
recalc_sigpending(current);
regs->eax = -EINTR;
while (1) {
current->state = TASK_INTERRUPTIBLE;
schedule( );
if (do_signal(regs, &saveset))
return -EINTR;
}该schedule( )函数选择另一个进程来运行。当发出系统调用的进程
sigsuspend( )再次执行时,sys_sigsuspend( )
调用该do_signal( )函数来传递已唤醒进程的信号。如果该函数返回值 1,则不会忽略该信号。因此,系统调用通过返回错误代码来终止-EINTR。
The schedule( ) function
selects another process to run. When the process that issued the
sigsuspend( ) system call is
executed again, sys_sigsuspend( )
invokes the do_signal( ) function
to deliver the signal that has awakened the process. If that function
returns the value 1, the signal is not ignored. Therefore the system
call terminates by returning the error code -EINTR.
系统sigsuspend( )调用可能显得多余,因为组合执行sigprocmask( ) 和sleep( )
显然产生相同的结果。但事实并非如此:因为进程可以随时交错,所以必须意识到调用一个系统调用来执行操作 A,然后调用另一个系统调用来执行操作 B 并不等同于调用执行操作 A 的单个系统调用然后是动作B。
The sigsuspend( ) system call
may appear redundant, because the combined execution of sigprocmask( ) and sleep( )
apparently yields the same result. But this is not
true: because processes can be interleaved at any time, one must be
conscious that invoking a system call to perform action A followed by
another system call to perform action B is not equivalent to invoking
a single system call that performs action A and then action B.
在这种特殊情况下,sigprocmask(
)可能会解锁在调用之前传递的信号
sleep( )。如果发生这种情况,进程可能会TASK_INTERRUPTIBLE永远保持在某种状态,等待已经传递的信号。另一方面,
sigsuspend( )系统调用不允许在解除阻塞之后和调用之前发送信号schedule( ),因为在该时间间隔内其他进程无法抢占CPU。
In this particular case, sigprocmask(
) might unblock a signal that is delivered before invoking
sleep( ). If this happens, the
process might remain in a TASK_INTERRUPTIBLE state forever, waiting
for the signal that was already delivered. On the other hand, the
sigsuspend( ) system call does not
allow signals to be sent after unblocking and before the schedule( ) invocation, because other
processes cannot grab the CPU during that time interval.
由于前面检查的系统调用仅适用于标准信号,因此必须引入额外的系统调用以允许用户模式进程处理实时信号。
Because the system calls previously examined apply only to standard signals, additional system calls must be introduced to allow User Mode processes to handle real-time signals .
几个系统调用实时信号(rt_sigaction( ) ,rt_sigpending( )
,rt_sigprocmask( )
, 和rt_sigsuspend(
) )与前面描述的类似,不再进一步讨论。出于同样的原因,我们不会讨论另外两个处理实时信号队列的系统调用:
Several system calls for real-time signals (rt_sigaction( ) , rt_sigpending( )
, rt_sigprocmask( )
, and rt_sigsuspend(
) ) are similar to those described earlier and won't be
discussed further. For the same reason, we won't discuss two other
system calls that deal with queues of real-time signals:
rt_sigqueueinfo( )rt_sigqueueinfo( )Sends a real-time signal so that it is added to the shared
pending signal queue of the destination process. Usually invoked
through the sigqueue( )
standard library function.
rt_sigtimedwait( )rt_sigtimedwait( )将阻塞的挂起信号出队而不传递它,并将信号号返回给调用者;如果没有待处理的阻塞信号,则将当前进程挂起一段固定的时间。通常通过sigwaitinfo( ) 和sigtimedwait(
) 标准库函数。
Dequeues a blocked pending signal without delivering it
and returns the signal number to the caller; if no blocked
signal is pending, suspends the current process for a fixed
amount of time. Usually invoked through the sigwaitinfo( ) and sigtimedwait(
) standard library functions.
Linux 成功的关键之一是它能够与其他系统舒适地共存。您可以透明地挂载承载 Windows 使用的文件格式的磁盘或分区、其他 Unix 系统,甚至是像 Amiga 这样市场份额很小的系统。Linux 通过称为虚拟文件系统的概念,以与其他 Unix 变体相同的方式支持多种文件系统类型。
One of Linux's keys to success is its ability to coexist comfortably with other systems. You can transparently mount disks or partitions that host file formats used by Windows , other Unix systems, or even systems with tiny market shares like the Amiga. Linux manages to support multiple filesystem types in the same way other Unix variants do, through a concept called the Virtual Filesystem.
虚拟文件系统背后的想法是将广泛的信息放入内核中来表示许多不同类型的文件系统; Linux 支持的所有真实文件系统所提供的每个操作都有一个字段或函数来支持。对于每个读取、写入或其他调用的函数,内核都会替换支持本机 Linux 文件系统 NTFS 的实际函数文件系统,或文件所在的任何其他文件系统。
The idea behind the Virtual Filesystem is to put a wide range of information in the kernel to represent many different types of filesystems ; there is a field or function to support each operation provided by all real filesystems supported by Linux. For each read, write, or other function called, the kernel substitutes the actual function that supports a native Linux filesystem, the NTFS filesystem, or whatever other filesystem the file is on.
本章讨论 Linux 虚拟文件系统的目标、结构和实现。它重点关注五种标准 Unix 文件类型中的三种,即常规文件、目录和符号链接。第 13 章介绍了设备文件,第 19 章讨论了管道。为了展示真实的文件系统是如何工作的,第 18 章 介绍了几乎所有 Linux 系统上出现的第二个扩展文件系统。
This chapter discusses the aims, structure, and implementation of Linux's Virtual Filesystem. It focuses on three of the five standard Unix file types—namely, regular files, directories, and symbolic links. Device files are covered in Chapter 13, while pipes are discussed in Chapter 19. To show how a real filesystem works, Chapter 18 covers the Second Extended Filesystem that appears on nearly all Linux systems.
虚拟文件系统(也称为虚拟文件系统交换机或VFS)是一个内核软件层,用于处理与标准 Unix 文件系统相关的所有系统调用。它的主要优势是为多种文件系统提供通用接口。
The Virtual Filesystem (also known as Virtual Filesystem Switch or VFS) is a kernel software layer that handles all system calls related to a standard Unix filesystem. Its main strength is providing a common interface to several kinds of filesystems.
例如,假设用户发出 shell 命令:
For instance, let's assume that a user issues the shell command:
$ cp /软盘/测试/tmp/测试
$ cp /floppy/TEST /tmp/test
其中/floppy是 MS-DOS 的挂载点软盘和/tmp是普通的第二扩展文件系统 (Ext2) 目录。VFS 是应用程序和文件系统实现之间的抽象层(见图 12-1 (a))。因此,cp程序不需要知道/floppy/TEST和 /tmp/test的文件系统类型。相反,cp 通过任何做过 Unix 编程的人都知道的通用系统调用来与 VFS 交互(请参阅第 1 章中的“文件处理系统调用”部分);cp执行的代码如图12-1 (b)所示。
where /floppy is the mount point of an MS-DOS diskette and /tmp is a normal Second Extended Filesystem (Ext2) directory. The VFS is an abstraction layer between the application program and the filesystem implementations (see Figure 12-1(a)). Therefore, the cp program is not required to know the filesystem types of /floppy/TEST and /tmp/test. Instead, cp interacts with the VFS by means of generic system calls known to anyone who has done Unix programming (see the section "File-Handling System Calls" in Chapter 1); the code executed by cp is shown in Figure 12-1(b).
VFS 支持的文件系统可以分为三个主要类别:
Filesystems supported by the VFS may be grouped into three main classes:
它们管理本地磁盘或模拟磁盘的某些其他设备(例如 USB 闪存驱动器)中的可用内存空间。一些著名的基于磁盘的文件系统VFS 支持的有:
Linux 的文件系统,例如广泛使用的第二扩展文件系统 (Ext2)、最近的第三扩展文件系统 (Ext3) 和 Reiser 文件系统 (ReiserFS) ) [ * ]
Unix 变体的文件系统,例如 sysv文件系统(系统V, 相干, 克尼克斯)、UFS(BSD, 索拉里斯, 下一步), 迷你文件系统和 VERITAS VxFS(SCO UnixWare)
Microsoft 文件系统,例如 MS-DOS、VFAT(视窗95 及更高版本)和 NTFS(Windows NT 4 及更高版本)
ISO9660CD-ROM 文件系统(以前称为 High Sierra文件系统)和通用磁盘格式(UDF) DVD 文件系统
其他专有文件系统,例如 IBM 的 OS/2(HPFS)、Apple 的 Macintosh(HFS)、Amiga 的快速文件系统(AFFS)和橡子磁盘文件系统(ADFS)
These manage memory space available in a local disk or in some other device that emulates a disk (such as a USB flash drive). Some of the well-known disk-based filesystems supported by the VFS are:
Filesystems for Linux such as the widely used Second Extended Filesystem (Ext2), the recent Third Extended Filesystem (Ext3), and the Reiser Filesystems (ReiserFS )[*]
Filesystems for Unix variants such as sysv filesystem (System V , Coherent , Xenix ), UFS (BSD , Solaris , NEXTSTEP ), MINIX filesystem, and VERITAS VxFS (SCO UnixWare )
Microsoft filesystems such as MS-DOS, VFAT (Windows 95 and later releases), and NTFS (Windows NT 4 and later releases)
ISO9660 CD-ROM filesystem (formerly High Sierra Filesystem) and Universal Disk Format (UDF ) DVD filesystem
Other proprietary filesystems such as IBM's OS/2 (HPFS ), Apple's Macintosh (HFS ), Amiga's Fast Filesystem (AFFS ), and Acorn Disk Filing System (ADFS )
Additional journaling filesystems originating in systems other than Linux such as IBM's JFS and SGI's XFS
这些允许轻松访问属于其他联网计算机的文件系统中包含的文件。一些著名的网络文件系统VFS支持的有NFS, 尾声, AFS(安德鲁文件系统)、CIFS(通用 Internet 文件系统,用于 Microsoft Windows)和NCP(Novell 的 NetWare 核心协议)。
These allow easy access to files included in filesystems belonging to other networked computers. Some well-known network filesystems supported by the VFS are NFS , Coda , AFS (Andrew filesystem), CIFS (Common Internet File System, used in Microsoft Windows ), and NCP (Novell's NetWare Core Protocol).
它们不管理本地或远程磁盘空间。/ proc 文件系统是特殊文件系统的典型例子(参见后面的“特殊文件系统”一节)。
These do not manage disk space, either locally or remotely. The /proc filesystem is a typical example of a special filesystem (see the later section "Special Filesystems").
在本书中,我们仅详细描述 Ext2 和 Ext3 文件系统(参见第 18 章);由于空间不足,其他文件系统未涵盖。
In this book, we describe in detail the Ext2 and Ext3 filesystems only (see Chapter 18); the other filesystems are not covered for lack of space.
正如第一章“ Unix 文件系统概述”一节中提到的,Unix 目录构建一棵树,其根为/目录。根目录包含在根文件系统中,在 Linux 中,根文件系统通常是 Ext2 或 Ext3 类型。所有其他文件系统都可以“安装”在根文件系统的子目录上。[ * ]
As mentioned in the section "An Overview of the Unix Filesystem" in Chapter 1, Unix directories build a tree whose root is the / directory. The root directory is contained in the root filesystem, which in Linux, is usually of type Ext2 or Ext3. All other filesystems can be "mounted" on subdirectories of the root filesystem.[*]
基于磁盘的文件系统通常存储在硬件块设备中,例如硬盘、软盘或 CD-ROM。Linux VFS 的一个有用功能允许它处理虚拟块设备 例如/dev/loop0,它可用于挂载存储在常规文件中的文件系统。作为一种可能的应用程序,用户可以通过将其加密版本存储在常规文件中来保护自己的私有文件系统。
A disk-based filesystem is usually stored in a hardware block device such as a hard disk, a floppy, or a CD-ROM. A useful feature of Linux's VFS allows it to handle virtual block devices such as /dev/loop0, which may be used to mount filesystems stored in regular files. As a possible application, a user may protect her own private filesystem by storing an encrypted version of it in a regular file.
第一个虚拟文件系统包含在 Sun Microsystems 的 SunOS 中1986 年。从那时起,大多数 Unix 文件系统都包含 VFS。然而,Linux 的 VFS 支持最广泛的文件系统。
The first Virtual Filesystem was included in Sun Microsystems's SunOS in 1986. Since then, most Unix filesystems include a VFS. Linux's VFS, however, supports the widest range of filesystems.
VFS 背后的关键思想包括引入 通用文件模型 能够代表所有支持的文件系统。该模型严格反映了传统 Unix 文件系统提供的文件模型。这并不奇怪,因为 Linux 希望以最小的开销运行其本机文件系统。然而,每个特定的文件系统实现必须将其物理组织转换为VFS的公共文件模型。
The key idea behind the VFS consists of introducing a common file model capable of representing all supported filesystems. This model strictly mirrors the file model provided by the traditional Unix filesystem. This is not surprising, because Linux wants to run its native filesystem with minimum overhead. However, each specific filesystem implementation must translate its physical organization into the VFS's common file model.
例如,在普通文件模型中,每个目录被视为一个文件,其中包含文件和其他目录的列表。然而,一些非 Unix 基于磁盘的文件系统使用文件分配表 (FAT),它存储每个文件在目录树中的位置。在这些文件系统中,目录不是文件。为了坚持 VFS 的通用文件模型,此类基于 FAT 的文件系统的 Linux 实现必须能够在需要时动态构建与目录相对应的文件。此类文件仅作为对象存在在内核内存中。
For instance, in the common file model, each directory is regarded as a file, which contains a list of files and other directories. However, several non-Unix disk-based filesystems use a File Allocation Table (FAT), which stores the position of each file in the directory tree. In these filesystems, directories are not files. To stick to the VFS's common file model, the Linux implementations of such FAT-based filesystems must be able to construct on the fly, when needed, the files corresponding to the directories. Such files exist only as objects in kernel memory.
更本质上,Linux 内核无法硬编码特定函数来处理诸如read(
) 或者ioctl( )
。相反,它必须为每个操作使用一个指针;指针指向正在访问的特定文件系统的正确函数。
More essentially, the Linux kernel cannot hardcode a particular
function to handle an operation such as read(
) or ioctl( )
. Instead, it must use a pointer for each operation;
the pointer is made to point to the proper function for the particular
filesystem being accessed.
让我们通过展示内核如何将图 12-1read( )中所示的内容转换为特定于 MS-DOS 的调用来说明这个概念文件系统。应用程序的调用使read( )内核调用相应的sys_read( )服务例程,就像每个其他系统调用一样。该文件由内核内存中的数据结构表示
file,我们将在本章后面看到。该数据结构包含一个名为 的字段,f_op该字段包含指向 MS-DOS 文件特定函数的指针,其中包括读取文件的函数。sys_read(
)找到指向该函数的指针并调用它。因此,应用程序的调用read( )变成了相当间接的调用:
Let's illustrate this concept by showing how the read( ) shown in Figure 12-1 would be
translated by the kernel into a call specific to the MS-DOS filesystem. The application's call to read( ) makes the kernel invoke the
corresponding sys_read( ) service
routine, like every other system call. The file is represented by a
file data structure in kernel
memory, as we'll see later in this chapter. This data structure
contains a field called f_op that
contains pointers to functions specific to MS-DOS files, including a
function that reads a file. sys_read(
) finds the pointer to this function and invokes it. Thus,
the application's read( ) is turned
into the rather indirect call:
文件->f_op->读取(...);
file->f_op->read(...);
类似地,该write( )
操作会触发与输出文件关联的正确 Ext2 写入函数的执行。简而言之,内核负责将正确的指针集分配给与file每个打开的文件关联的变量,然后调用特定于该字段f_op指向的每个文件系统的调用。
Similarly, the write( )
operation triggers the execution of a proper Ext2 write function
associated with the output file. In short, the kernel is responsible
for assigning the right set of pointers to the file variable associated with each open
file, and then for invoking the call specific to each filesystem that
the f_op field points to.
人们可以将通用文件模型视为面向对象,其中对象是一种软件构造,它定义了数据结构和对其进行操作的方法。出于效率的考虑,Linux 没有使用 C++ 等面向对象语言进行编码。因此,对象被实现为纯 C 数据结构,其中一些字段指向与对象方法相对应的函数。
One can think of the common file model as object-oriented, where an object is a software construct that defines both a data structure and the methods that operate on it. For reasons of efficiency, Linux is not coded in an object-oriented language such as C++. Objects are therefore implemented as plain C data structures with some fields pointing to functions that correspond to the object's methods.
通用文件模型由以下对象类型组成:
The common file model consists of the following object types:
存储有关已安装文件系统的信息。对于基于磁盘的文件系统,该对象通常对应于 存储在磁盘上的文件系统控制块。
Stores information concerning a mounted filesystem. For disk-based filesystems, this object usually corresponds to a filesystem control block stored on disk.
存储有关特定文件的一般信息。对于基于磁盘的文件系统,该对象通常对应于 存储在磁盘上的文件控制块。每个 inode 对象都与一个inode number关联,它唯一标识文件系统中的文件。
Stores general information about a specific file. For disk-based filesystems, this object usually corresponds to a file control block stored on disk. Each inode object is associated with an inode number, which uniquely identifies the file within the filesystem.
存储有关打开的文件和进程之间交互的信息。此信息仅在进程打开文件期间存在于内核内存中。
Stores information about the interaction between an open file and a process. This information exists only in kernel memory during the period when a process has the file open.
存储有关目录条目(即文件的特定名称)与相应文件的链接的信息。每个基于磁盘的文件系统都以自己特定的方式将这些信息存储在磁盘上。
Stores information about the linking of a directory entry (that is, a particular name of the file) with the corresponding file. Each disk-based filesystem stores this information in its own particular way on disk.
图 12-2 用一个简单的例子说明了进程如何与文件交互。三个不同的进程打开了同一个文件,其中两个进程使用相同的硬链接。在这种情况下,三个进程中的每一个都使用自己的文件对象,而只有两个 dentry 对象是必需的——每个硬链接一个。两个 dentry 对象都引用同一个 inode 对象,该对象标识超级块对象,并与后者一起标识公共磁盘文件。
Figure 12-2 illustrates with a simple example how processes interact with files. Three different processes have opened the same file, two of them using the same hard link. In this case, each of the three processes uses its own file object, while only two dentry objects are required—one for each hard link. Both dentry objects refer to the same inode object, which identifies the superblock object and, together with the latter, the common disk file.
除了为所有文件系统实现提供通用接口之外,VFS 还有另一个与系统性能相关的重要作用。最近使用的 dentry 对象包含在名为dentry 缓存的磁盘缓存中 ,这加速了从文件路径名到最后一个路径名组件的 inode 的转换。
Besides providing a common interface to all filesystem implementations, the VFS has another important role related to system performance. The most recently used dentry objects are contained in a disk cache named the dentry cache , which speeds up the translation from a file pathname to the inode of the last pathname component.
一般来说,磁盘缓存是一种软件机制,允许内核将通常存储在磁盘上的一些信息保留在 RAM 中,以便可以快速满足对该数据的进一步访问,而无需缓慢访问磁盘本身。
Generally speaking, a disk cache is a software mechanism that allows the kernel to keep in RAM some information that is normally stored on a disk, so that further accesses to that data can be quickly satisfied without a slow access to the disk itself.
请注意磁盘缓存与硬件缓存或内存缓存的不同之处,两者都与磁盘或其他设备无关。硬件高速缓存是一种快速静态 RAM,可加速针对较慢动态 RAM 的请求(请参阅第 2 章中的“硬件高速缓存”部分)。内存缓存是一种软件机制,旨在绕过内核内存分配器(请参阅第 8 章中的“ Slab 分配器”部分)。
Notice how a disk cache differs from a hardware cache or a memory cache, neither of which has anything to do with disks or other devices. A hardware cache is a fast static RAM that speeds up requests directed to the slower dynamic RAM (see the section "Hardware Cache" in Chapter 2). A memory cache is a software mechanism introduced to bypass the Kernel Memory Allocator (see the section "The Slab Allocator" in Chapter 8).
除了 dentry 缓存和 inode 缓存之外,Linux 还使用其他磁盘缓存。最重要的一个称为页面缓存,在第 15 章中有详细描述。
Beside the dentry cache and the inode cache, Linux uses other disk caches. The most important one, called the page cache, is described in detail in Chapter 15.
表 12-1说明了 VFS 系统调用指的是文件系统、常规文件、目录和符号链接。VFS 处理的其他一些系统调用,例如ioperm( ) ,ioctl( )
,pipe( ) , 和mknod( )
,参考设备文件和管道。这些将在后面的章节中讨论。由VFS处理的最后一组系统调用,例如socket( ) ,connect( )
, 和bind( )
,指的是套接字,用于实现联网。与表 12-1中列出的系统调用相对应的一些内核服务例程将在本章或第 18 章中讨论。
Table
12-1 illustrates the VFS system calls that refer to filesystems, regular files, directories,
and symbolic links. A few other system calls handled by the VFS, such
as ioperm( ) , ioctl( )
, pipe( ) , and mknod( )
, refer to device files and pipes. These are discussed
in later chapters. A last group of system calls handled by the VFS,
such as socket( ) , connect( )
, and bind( )
, refer to sockets and are used to implement
networking. Some of the kernel service routines that correspond to the
system calls listed in Table 12-1 are discussed
either in this chapter or in Chapter 18.
表 12-1。VFS处理的一些系统调用
Table 12-1. Some system calls handled by the VFS
我们前面说过,VFS 是应用程序和特定文件系统之间的一层。然而,在某些情况下,文件操作可以由VFS本身执行,而不需要调用较低级别的过程。例如,当进程关闭一个打开的文件时,通常不需要触及磁盘上的文件,因此VFS只需释放相应的文件对象即可。同样,当lseek( )系统调用修改文件指针,文件指针是与打开的文件和进程交互相关的属性,VFS只需要修改相应的文件对象,而不需要访问磁盘上的文件,因此不需要调用特定的文件对象。文件系统过程。从某种意义上说,VFS 可以被认为是一个“通用”文件系统,在必要时依赖于特定的文件系统。
We said earlier that the VFS is a layer between application
programs and specific filesystems. However, in some cases, a file
operation can be performed by the VFS itself, without invoking a
lower-level procedure. For instance, when a process closes an open
file, the file on disk doesn't usually need to be touched, and hence
the VFS simply releases the corresponding file object. Similarly, when
the lseek( ) system call modifies a
file pointer, which is an attribute related to the interaction between
an opened file and a process, the VFS needs to modify only the
corresponding file object without accessing the file on disk, and
therefore it does not have to invoke a specific filesystem procedure.
In some sense, the VFS could be considered a "generic" filesystem that
relies, when necessary, on specific ones.
[ * ]尽管这些文件系统诞生于 Linux,但它们已被移植到其他几个操作系统。
[*] Although these filesystems owe their birth to Linux, they have been ported to several other operating systems.
[ * ]当文件系统安装在目录上时,父文件系统中的目录内容将不再可访问,因为每个路径名(包括安装点)都将引用已安装的文件系统。但是,卸载文件系统后,原始目录的内容会再次显示。Unix 文件系统的这个有点令人惊讶的功能被系统管理员用来隐藏文件。他们只是在包含要隐藏的文件的目录上安装一个文件系统。
[*] When a filesystem is mounted on a directory, the contents of the directory in the parent filesystem are no longer accessible, because every pathname, including the mount point, will refer to the mounted filesystem. However, the original directory's content shows up again when the filesystem is unmounted. This somewhat surprising feature of Unix filesystems is used by system administrators to hide files; they simply mount a filesystem on the directory containing the files to be hidden.
每个VFS对象都存储在合适的数据结构中,其中包括对象属性和指向对象方法表的指针。内核可以动态修改对象的方法,因此,它可以为对象安装专门的行为。以下部分详细解释了 VFS 对象及其相互关系。
Each VFS object is stored in a suitable data structure, which includes both the object attributes and a pointer to a table of object methods. The kernel may dynamically modify the methods of the object and, hence, it may install specialized behavior for the object. The following sections explain the VFS objects and their interrelationships in detail.
超级块对象由一个super_block结构体组成,其字段如表 12-2所示。
A superblock object consists of a super_block structure whose fields are
described in Table
12-2.
表 12-2。超级块对象的字段
Table 12-2. The fields of the superblock object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 超级块列表的指针 Pointers for superblock list |
| | 设备标识符 Device identifier |
| | 块大小(以字节为单位) Block size in bytes |
| | 由底层块设备驱动程序报告的块大小(以字节为单位) Block size in bytes as reported by the underlying block device driver |
| | 块大小(以位数表示) Block size in number of bits |
| | 修改(脏)标志 Modified (dirty) flag |
| | 文件的最大大小 Maximum size of the files |
| | 文件系统类型 Filesystem type |
| | 超级块方法 Superblock methods |
| | 磁盘配额处理方法 Disk quota handling methods |
结构体quotactl_ops * struct quotactl_ops * | s_qcop s_qcop | 磁盘配额管理方法 Disk quota administration methods |
结构导出_操作 * struct export_operations * | s_export_op s_export_op | 网络文件系统使用的导出操作 Export operations used by network filesystems |
| | 挂载旗帜 Mount flags |
| | 文件系统幻数 Filesystem magic number |
| | 文件系统根目录的 Dentry 对象 Dentry object of the filesystem's root directory |
| | 用于卸载的信号量 Semaphore used for unmounting |
| | 超级块信号量 Superblock semaphore |
| | 参考计数器 Reference counter |
整数 int | s_同步 s_syncing | 指示超级块的 inode 正在同步的标志 Flag indicating that inodes of the superblock are being synchronized |
整数 int | s_need_sync_fs s_need_sync_fs | 同步超级块的已安装文件系统时使用的标志 Flag used when synchronizing the superblock's mounted filesystem |
| | 辅助参考计数器 Secondary reference counter |
空白 * void * | s_安全 s_security | 指向超级块安全结构的指针 Pointer to superblock security structure |
结构xattr_handler ** struct xattr_handler ** | s_xattr s_xattr | 指向超级块扩展属性结构的指针 Pointer to superblock extended attribute structure |
结构列表头 struct list_head | s_inodes s_inodes | 所有索引节点列表 List of all inodes |
| | 修改的索引节点列表 List of modified inodes |
| | 等待写入磁盘的 inode 列表 List of inodes waiting to be written to disk |
| | 用于处理远程网络文件系统的匿名目录列表 List of anonymous dentries for handling remote network filesystems |
| | 文件对象列表 List of file objects |
| | 指向块设备驱动程序描述符的指针 Pointer to the block device driver descriptor |
| | 给定文件系统类型的超级块对象列表的指针(请参阅后面的“文件系统类型注册”部分) Pointers for a list of superblock objects of a given filesystem type (see the later section "Filesystem Type Registration") |
| | 磁盘配额描述符 Descriptor for disk quota |
整数 int | s_frozen s_frozen | 冻结文件系统时使用的标志(强制其处于一致状态) Flag used when freezing the filesystem (forcing it to a consistent state) |
等待队列头 wait_queue_head_t | s_wait_unfrozen s_wait_unfrozen | 等待队列,进程在其中休眠,直到文件系统解冻 Wait queue where processes sleep until the filesystem is unfrozen |
字符[] char[] | s_id s_id | 包含超级块的块设备的名称 Name of the block device containing the superblock |
空白 * void * | s_fs_信息 s_fs_info | 指向特定文件系统超级块信息的指针 Pointer to superblock information of a specific filesystem |
| | 跨目录重命名文件时 VFS 使用的信号量 Semaphore used by VFS when renaming files across directories |
u32 u32 | s_time_gran s_time_gran | 时间戳的粒度(以纳秒为单位) Timestamp's granularity (in nanoseconds) |
所有超级块对象都链接在循环双向链表中。该列表的第一个元素由super_blocks变量表示,而s_list超级块对象的字段存储指向列表中相邻元素的指针。这sb_lock锁可保护列表免受多处理器系统中的并发访问。
All superblock objects are linked in a circular doubly linked
list. The first element of this list is represented by the super_blocks variable, while the s_list field of the superblock object stores
the pointers to the adjacent elements in the list. The sb_lock spin lock protects the list against
concurrent accesses in multiprocessor systems.
该s_fs_info字段指向属于特定文件系统的超级块信息;例如,正如我们稍后将在第 18 章中看到的,如果超级块对象引用 Ext2 文件系统,则该字段指向一个ext2_sb_info结构,其中包括磁盘分配位掩码和与 VFS 公共文件模型无关的其他数据。
The s_fs_info field points to
superblock information that belongs to a specific filesystem; for
instance, as we'll see later in Chapter 18, if the superblock
object refers to an Ext2 filesystem, the field points to an ext2_sb_info structure, which includes the
disk allocation bit masks and other data of no concern to the VFS
common file model.
s_fs_info一般来说,出于效率考虑,该字段指向的数据是从磁盘复制到内存中的信息。每个基于磁盘的文件系统都需要访问和更新其分配位图,以便分配或释放磁盘块。VFS 允许这些文件系统直接作用于s_fs_info
内存中超级块的字段,而无需访问磁盘。
In general, data pointed to by the s_fs_info field is information from the disk
duplicated in memory for reasons of efficiency. Each disk-based
filesystem needs to access and update its allocation bitmaps in order
to allocate or release disk blocks. The VFS allows these filesystems
to act directly on the s_fs_info
field of the superblock in memory without accessing the disk.
然而,这种方法会导致一个新问题:VFS 超级块最终可能不再与磁盘上相应的超级块同步。因此需要引入一个s_dirt标志,该标志指定超级块是否脏——即磁盘上的数据是否必须更新。当站点断电而用户没有机会完全关闭系统时,缺乏同步会导致常见的文件系统损坏问题。正如我们将在第 15 章的“将脏页写入磁盘”部分中看到的那样,Linux 通过定期复制所有脏超级块来最大限度地减少此问题。到磁盘。
This approach leads to a new problem, however: the VFS
superblock might end up no longer synchronized with the corresponding
superblock on disk. It is thus necessary to introduce an s_dirt flag, which specifies whether the
superblock is dirty—that is, whether the data on the disk must be
updated. The lack of synchronization leads to the familiar problem of
a corrupted filesystem when a site's power goes down without giving
the user the chance to shut down a system cleanly. As we'll see in the
section "Writing Dirty
Pages to Disk" in Chapter
15, Linux minimizes this problem by periodically copying all
dirty superblocks to disk.
与超级块关联的方法称为
超级块操作 。它们super_operations由其地址包含在s_op字段中的结构来描述。
The methods associated with a superblock are called
superblock operations . They are described by the super_operations structure whose address is
included in the s_op field.
每个特定的文件系统都可以定义自己的超级块操作。当 VFS 需要调用其中之一时,例如read_inode( ),它会执行以下操作:
Each specific filesystem can define its own superblock
operations. When the VFS needs to invoke one of them, say read_inode( ), it executes the
following:
sb->s_op->read_inode(inode);
sb->s_op->read_inode(inode);
其中sb存储涉及的超级块对象的地址。该表read_inode的字段super_operations包含合适函数的地址,因此可以直接调用该函数。
where sb stores the address
of the superblock object involved. The read_inode field of the super_operations table contains the address
of the suitable function, which is therefore directly invoked.
我们简单描述一下超级块操作,它实现了删除文件或挂载磁盘等更高级别的操作。它们按照在super_operations表中出现的顺序列出:
Let's briefly describe the superblock operations, which
implement higher-level operations like deleting files or mounting
disks. They are listed in the order they appear in the super_operations table:
alloc_inode(sb)alloc_inode(sb)为 inode 对象分配空间,包括文件系统特定数据所需的空间。
Allocates space for an inode object, including the space required for filesystem-specific data.
destroy_inode(inode)destroy_inode(inode)销毁 inode 对象,包括文件系统特定的数据。
Destroys an inode object, including the filesystem-specific data.
read_inode(inode)read_inode(inode)用磁盘上的数据填充作为参数传递的 inode 对象的字段;inode 对象的字段i_ino标识要读取的磁盘上的特定文件系统 inode。
Fills the fields of the inode object passed as the
parameter with the data on disk; the i_ino field of the inode object
identifies the specific filesystem inode on the disk to be
read.
dirty_inode(inode)dirty_inode(inode)当 inode 被标记为已修改(脏)时调用。由 ReiserFS 和 Ext3 等文件系统用来更新磁盘上的文件系统日志。
Invoked when the inode is marked as modified (dirty). Used by filesystems such as ReiserFS and Ext3 to update the filesystem journal on disk.
write_inode(inode,
flag)write_inode(inode,
flag)使用作为参数传递的 inode 对象的内容更新文件系统 inode;inode 对象的字段i_ino标识相关磁盘上的文件系统 inode。该
flag参数指示 I/O 操作是否应该同步。
Updates a filesystem inode with the contents of the inode
object passed as the parameter; the i_ino field of the inode object
identifies the filesystem inode on disk that is concerned. The
flag parameter indicates
whether the I/O operation should be synchronous.
put_inode(inode)put_inode(inode)当索引节点被释放(其引用计数器减少)时调用,以执行特定于文件系统的操作。
Invoked when the inode is released—its reference counter is decreased—to perform filesystem-specific operations.
drop_inode(inode)drop_inode(inode)当inode即将被销毁时——即最后一个用户释放inode时调用;实现此方法的文件系统通常使用generic_drop_inode( ). 此函数从 VFS 数据结构中删除对 inode 的所有引用,如果该 inode 不再出现在任何目录中,则调用 superblock 方法delete_inode
从文件系统中删除该 inode。
Invoked when the inode is about to be destroyed—that is,
when the last user releases the inode; filesystems that
implement this method usually make use of generic_drop_inode( ). This function
removes every reference to the inode from the VFS data
structures and, if the inode no longer appears in any directory,
invokes the delete_inode
superblock method to delete the inode from the
filesystem.
delete_inode(inode)delete_inode(inode)当必须销毁 inode 时调用。删除内存中的VFS inode以及磁盘上的文件数据和元数据。
Invoked when the inode must be destroyed. Deletes the VFS inode in memory and the file data and metadata on disk.
put_super(super)put_super(super)释放作为参数传递的超级块对象(因为相应的文件系统已卸载)。
Releases the superblock object passed as the parameter (because the corresponding filesystem is unmounted).
write_super(super)write_super(super)使用指定对象的内容更新文件系统超级块。
Updates a filesystem superblock with the contents of the object indicated.
sync_fs(sb, wait)sync_fs(sb, wait)Invoked when flushing the filesystem to update filesystem-specific data structures on disk (used by journaling filesystems ).
write_super_lockfs(super)write_super_lockfs(super)阻止对文件系统的更改并使用指示的对象的内容更新超级块。当文件系统被冻结时(例如由逻辑卷管理器 (LVM) 驱动程序冻结)会调用此方法。
Blocks changes to the filesystem and updates the superblock with the contents of the object indicated. This method is invoked when the filesystem is frozen, for instance by the Logical Volume Manager (LVM) driver.
unlockfs(super)unlockfs(super)撤消通过超级块方法实现的文件系统更新块
write_super_lockfs。
Undoes the block of filesystem updates achieved by the
write_super_lockfs superblock
method.
statfs(super, buf)statfs(super, buf)通过填充缓冲区返回文件系统的统计信息buf。
Returns statistics on a filesystem by filling the buf buffer.
remount_fs(super, flags,
data)remount_fs(super, flags,
data)使用新选项重新挂载文件系统(必须更改挂载选项时调用)。
Remounts the filesystem with new options (invoked when a mount option must be changed).
clear_inode(inode)clear_inode(inode)当磁盘 inode 被销毁以执行特定于文件系统的操作时调用。
Invoked when a disk inode is being destroyed to perform filesystem-specific operations.
umount_begin(super)umount_begin(super)Aborts a mount operation because the corresponding unmount operation has been started (used only by network filesystems ).
show_options(seq_file,
vfsmount)show_options(seq_file,
vfsmount)用于显示特定于文件系统的选项
Used to display the filesystem-specific options
quota_read(super, type, data, size,
offset)quota_read(super, type, data, size,
offset)由配额系统使用从指定该文件系统限制的文件中读取数据。[ * ]
Used by the quota system to read data from the file that specifies the limits for this filesystem.[*]
quota_write(super, type, data,
size, offset)quota_write(super, type, data,
size, offset)配额系统使用它来将数据写入指定该文件系统限制的文件中。
Used by the quota system to write data into the file that specifies the limits for this filesystem.
上述方法适用于所有可能的文件系统类型。然而,只有其中的一个子集适用于每个特定的文件系统;与未实现的方法对应的字段设置为NULL。请注意,没有get_super定义读取超级块的方法 - 内核如何调用尚未从磁盘读取的对象的方法?我们将在另一个描述文件系统类型的对象中找到等效方法(请参阅后面的“文件系统类型注册get_sb”部分)。
The preceding methods are available to all possible filesystem
types. However, only a subset of them applies to each specific
filesystem; the fields corresponding to unimplemented methods are set
to NULL. Notice that no get_super method to read a superblock is
defined—how could the kernel invoke a method of an object yet to be
read from disk? We'll find an equivalent get_sb method in another object describing
the filesystem type (see the later section "Filesystem Type
Registration").
文件系统处理文件所需的所有信息都包含在称为索引节点的数据结构中。文件名是随意分配的标签,可以更改,但索引节点对于文件来说是唯一的,并且只要文件存在就保持不变。内存中的 inode 对象由一个结构体组成,其字段如表 12-3inode所示。
All information needed by the filesystem to handle a
file is included in a data structure called an inode. A filename is a
casually assigned label that can be changed, but the inode is unique
to the file and remains the same as long as the file exists. An inode
object in memory consists of an inode structure whose fields are described
in Table
12-3.
表 12-3。inode 对象的字段
Table 12-3. The fields of the inode object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 哈希列表的指针 Pointers for the hash list |
| | 描述 inode 当前状态的列表的指针 Pointers for the list that describes the inode's current state |
| i_sb_列表 i_sb_list | 超级块索引节点列表的指针 Pointers for the list of inodes of the superblock |
| | 引用该 inode 的 dentry 对象列表的头部 The head of the list of dentry objects referencing this inode |
| | 索引节点号 inode number |
| | 使用计数器 Usage counter |
| | 文件类型和访问权限 File type and access rights |
| | 硬链接数量 Number of hard links |
| | 所有者标识符 Owner identifier |
| | 组标识符 Group identifier |
| | 真实设备标识符 Real device identifier |
| | 文件长度(以字节为单位) File length in bytes |
| | 上次访问文件的时间 Time of last file access |
| | 上次写入文件的时间 Time of last file write |
| | 最后一次 inode 改变的时间 Time of last inode change |
| | 块大小(以位数表示) Block size in number of bits |
| | 块大小(以字节为单位) Block size in bytes |
| | 版本号,每次使用后自动增加 Version number, automatically increased after each use |
| | 文件的块数 Number of blocks of the file |
无符号短 unsigned short | i_字节 i_bytes | 文件最后一个块中的字节数 Number of bytes in the last block of the file |
无符号字符 unsigned char | i_sock i_sock | 如果文件是套接字,则非零 Nonzero if file is a socket |
自旋锁_t spinlock_t | i_lock i_lock | 自旋锁保护索引节点的某些字段 Spin lock protecting some fields of the inode |
| | 索引节点信号量 inode semaphore |
结构体rw_semaphore struct rw_semaphore | i_alloc_sem i_alloc_sem | 读/写信号量可防止直接 I/O 文件操作中的竞争条件 Read/write semaphore protecting against race conditions in direct I/O file operations |
| | 索引节点操作 inode operations |
| | 默认文件操作 Default file operations |
| | 指向超级块对象的指针 Pointer to superblock object |
| | 指向文件锁定列表的指针 Pointer to file lock list |
| | 指向 Pointer to an |
| | |
| | 索引节点磁盘配额 inode disk quotas |
| | 与特定字符或块设备相关的 inode 列表的指针(参见第 13 章) Pointers for a list of inodes relative to a specific character or block device (see Chapter 13) |
| | 如果文件是管道则使用(参见 第 19 章) Used if the file is a pipe (see Chapter 19) |
| | 指向块设备驱动程序的指针 Pointer to the block device driver |
| | 指向字符设备驱动程序的指针 Pointer to the character device driver |
整数 int | i_cindex i_cindex | 一组次要编号中设备文件的索引 Index of the device file within a group of minor numbers |
| | inode 版本号(某些文件系统使用) inode version number (used by some filesystems) |
| | 目录通知事件的位掩码 Bit mask of directory notify events |
| | 用于目录通知 Used for directory notifications |
| | 索引节点状态标志 inode state flags |
无符号长 unsigned long | 脏了_当 dirtied_when | inode 的脏时间(以刻度为单位) Dirtying time (in ticks) of the inode |
| | 文件系统挂载标志 Filesystem mount flags |
| | 写入进程的使用计数器 Usage counter for writing processes |
| | 指向 inode 安全结构的指针 Pointer to inode's security structure |
| | 指向私有数据的指针 Pointer to private data |
seqcount_t seqcount_t | i_size_seqcount i_size_seqcount | SMP 系统中使用顺序计数器以获得一致的值 Sequence counter used in SMP systems
to get consistent values for |
每个索引节点对象都会复制磁盘索引节点中包含的一些数据,例如分配给文件的块数。当该字段的值i_state等于I_DIRTY_SYNC、I_DIRTY_DATASYNC、 或 时I_DIRTY_PAGES,该 inode 是脏的,即必须更新相应的磁盘 inode。该I_DIRTY宏可用于一次检查这三个标志的值(详细信息请参阅后面)。该i_state字段的其他值有I_LOCK(inode 对象正在参与 I/O 传输)、I_FREEING(inode 对象正在被释放)、I_CLEAR
(inode 对象内容不再有意义)、I_NEW(inode 对象已分配但尚未分配)但已填充从磁盘 inode 读取的数据)。
Each inode object duplicates some of the data included in the
disk inode—for instance, the number of blocks allocated to the file.
When the value of the i_state field
is equal to I_DIRTY_SYNC, I_DIRTY_DATASYNC, or I_DIRTY_PAGES, the inode is dirty—that is,
the corresponding disk inode must be updated. The I_DIRTY macro can be used to check the value
of these three flags at once (see later for details). Other values of
the i_state field are I_LOCK (the inode object is involved in an
I/O transfer), I_FREEING (the inode
object is being freed), I_CLEAR
(the inode object contents are no longer meaningful), and I_NEW (the inode object has been allocated
but not yet filled with data read from the disk inode).
每个 inode 对象始终出现在以下循环双向链表之一中(在所有情况下,指向相邻元素的指针都存储在该i_list字段中):
Each inode object always appears in one of the following
circular doubly linked lists (in all cases, the pointers to the
adjacent elements are stored in the i_list field):
有效的未使用索引节点列表,通常是那些镜像有效磁盘索引节点且当前未被任何进程使用的索引节点。这些 inode 不是脏的,并且它们的i_count字段设置为 0。该列表的第一个和最后一个元素分别由变量的
next和字段引用。该列表充当磁盘缓存。previnode_unused
The list of valid unused inodes, typically those mirroring
valid disk inodes and not currently used by any process. These
inodes are not dirty and their i_count field is set to 0. The first and
last elements of this list are referenced by the next and prev fields, respectively, of the
inode_unused variable. This
list acts as a disk cache.
正在使用的 inode 列表,即那些镜像有效磁盘 inode 并被某个进程使用的 inode。这些索引节点不脏,并且它们的i_count字段为正。第一个和最后一个元素由变量引用
inode_in_use。
The list of in-use inodes, that is, those mirroring valid
disk inodes and used by some process. These inodes are not dirty
and their i_count field is
positive. The first and last elements are referenced by the
inode_in_use variable.
脏索引节点列表。s_dirty第一个和最后一个元素由相应超级块对象的字段引用。
The list of dirty inodes. The first and last elements are
referenced by the s_dirty field
of the corresponding superblock object.
刚才提到的每个列表都链接i_list适当的 inode 对象的字段。
Each of the lists just mentioned links the i_list fields of the proper inode
objects.
s_inodes此外,每个inode对象还包含在以超级块对象字段为首的每个文件系统的双向链接循环列表中;inode 对象的字段
i_sb_list存储此列表中相邻元素的指针。
Moreover, each inode object is also included in a per-filesystem
doubly linked circular list headed at the s_inodes field of the superblock object; the
i_sb_list field of the inode object
stores the pointers for the adjacent elements in this list.
最后,inode 对象也包含在名为 的哈希表中inode_hashtable。当内核知道与包含该文件的文件系统对应的超级块对象的索引节点号和地址时,哈希表会加快索引节点对象的搜索速度。由于散列可能会引起冲突,因此 inode 对象包含一个i_hash字段,该字段包含指向散列到同一位置的其他 inode 的向后和向前指针;该字段创建这些 inode 的双向链表。
Finally, the inode objects are also included in a hash table
named inode_hashtable. The hash
table speeds up the search of the inode object when the kernel knows
both the inode number and the address of the superblock object
corresponding to the filesystem that includes the file. Because
hashing may induce collisions, the inode object includes an i_hash field that contains a backward and a
forward pointer to other inodes that hash to the same position; this
field creates a doubly linked list of those inodes.
与 inode 对象关联的方法也称为
inode 操作 。它们由结构描述inode_operations,其地址包含在i_op字段中。以下是索引节点操作(按照表中出现的顺序排列)inode_operations:
The methods associated with an inode object are also called
inode operations . They are described by an inode_operations structure, whose address is
included in the i_op field. Here
are the inode operations in the order they appear in the inode_operations table:
create(dir, dentry, mode,
nameidata)create(dir, dentry, mode,
nameidata)为与某个目录中的 dentry 对象关联的常规文件创建一个新的磁盘 inode。
Creates a new disk inode for a regular file associated with a dentry object in some directory.
lookup(dir, dentry,
nameidata)lookup(dir, dentry,
nameidata)在目录中搜索与 dentry 对象中包含的文件名相对应的 inode。
Searches a directory for an inode corresponding to the filename included in a dentry object.
link(old_dentry, dir,
new_dentry)link(old_dentry, dir,
new_dentry)创建一个新的硬链接,引用old_dentry目录中指定的文件dir;新的硬链接具有由 指定的名称new_dentry。
Creates a new hard link that refers to the file specified
by old_dentry in the
directory dir; the new hard
link has the name specified by new_dentry.
unlink(dir,
dentry)unlink(dir,
dentry)从目录中删除由 dentry 对象指定的文件的硬链接。
Removes the hard link of the file specified by a dentry object from a directory.
symlink(dir, dentry,
symname)symlink(dir, dentry,
symname)为与某个目录中的 dentry 对象关联的符号链接创建一个新的 inode。
Creates a new inode for a symbolic link associated with a dentry object in some directory.
mkdir(dir, dentry,
mode)mkdir(dir, dentry,
mode)为与某个目录中的 dentry 对象关联的目录创建一个新的 inode。
Creates a new inode for a directory associated with a dentry object in some directory.
rmdir(dir, dentry)rmdir(dir, dentry)从目录中删除其名称包含在 dentry 对象中的子目录。
Removes from a directory the subdirectory whose name is included in a dentry object.
mknod(dir, dentry, mode,
rdev)mknod(dir, dentry, mode,
rdev)为与某个目录中的 dentry 对象关联的特殊文件创建一个新的磁盘 inode。和参数分别指定文件类型以及设备的主设备号和次设备号mode。rdev
Creates a new disk inode for a special file associated
with a dentry object in some directory. The mode and rdev parameters specify, respectively,
the file type and the device's major and minor numbers.
rename(old_dir, old_dentry,
new_dir, new_dentry)rename(old_dir, old_dentry,
new_dir, new_dentry)将由 标识的文件old_entry从old_dir目录移动到该目录new_dir。新文件名包含在new_dentry指向的 dentry 对象中。
Moves the file identified by old_entry from the old_dir directory to the new_dir one. The new filename is
included in the dentry object that new_dentry points to.
readlink(dentry, buffer,
buflen)readlink(dentry, buffer,
buflen)buffer复制到由与 dentry 指定的符号链接相对应的文件路径名指定的用户模式内存区域中。
Copies into a User Mode memory area specified by buffer the file pathname corresponding
to the symbolic link specified by the dentry.
follow_link(inode,
nameidata)follow_link(inode,
nameidata)翻译由 inode 对象指定的符号链接;如果符号链接是相对路径名,则查找操作从第二个参数指定的目录开始。
Translates a symbolic link specified by an inode object; if the symbolic link is a relative pathname, the lookup operation starts from the directory specified in the second parameter.
put_link(dentry,
nameidata)put_link(dentry,
nameidata)释放由该方法分配的所有临时数据结构
follow_link以转换符号链接。
Releases all temporary data structures allocated by the
follow_link method to
translate a symbolic link.
truncate(inode)truncate(inode)修改与 inode 关联的文件的大小。在调用该方法之前,需要将i_sizeinode对象的字段设置为所需的新大小。
Modifies the size of the file associated with an inode.
Before invoking this method, it is necessary to set the i_size field of the inode object to
the required new size.
permission(inode, mask,
nameidata)permission(inode, mask,
nameidata)检查与 关联的文件是否允许指定的访问模式inode。
Checks whether the specified access mode is allowed for
the file associated with inode.
setattr(dentry,
iattr)setattr(dentry,
iattr)触摸 inode 属性后通知“更改事件”。
Notifies a "change event" after touching the inode attributes.
getattr(mnt, dentry,
kstat)getattr(mnt, dentry,
kstat)某些文件系统使用它来读取 inode 属性。
Used by some filesystems to read inode attributes.
setxattr(dentry, name, value, size,
flags)setxattr(dentry, name, value, size,
flags)设置 inode 的“扩展属性”(扩展属性存储在任何 inode 外部的磁盘块上)。
Sets an "extended attribute" of an inode (extended attributes are stored on disk blocks outside of any inode).
getxattr(dentry, name, buffer,
size)getxattr(dentry, name, buffer,
size)获取 inode 的扩展属性。
Gets an extended attribute of an inode.
listxattr(dentry, buffer,
size)listxattr(dentry, buffer,
size)获取扩展属性名称的完整列表。
Gets the whole list of extended attribute names.
removexattr(dentry,
name)removexattr(dentry,
name)删除 inode 的扩展属性。
Removes an extended attribute of an inode.
刚刚列出的方法可用于所有可能的 inode 和文件系统类型。然而,它们中只有一部分适用于特定的 inode 和文件系统;与未实现的方法对应的字段设置为NULL。
The methods just listed are available to all possible inodes and
filesystem types. However, only a subset of them applies to a specific
inode and filesystem; the fields corresponding to unimplemented
methods are set to NULL.
文件对象描述进程如何与其打开的文件交互。该对象在文件打开时创建,由一个结构组成,其字段如表 12-4file所示。注意文件对象磁盘上没有相应的映像,因此结构中不包含“脏”字段file来指定文件对象已被修改。
A file object describes how a process interacts with a
file it has opened. The object is created when the file is opened and
consists of a file structure, whose
fields are described in Table 12-4. Notice that
file objects have no corresponding image on disk, and hence no
"dirty" field is included in the file structure to specify that the file
object has been modified.
表 12-4。文件对象的字段
Table 12-4. The fields of the file object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 通用文件对象列表的指针 Pointers for generic file object list |
| | 与文件关联的 dentry 对象 dentry object associated with the file |
| | 包含该文件的已安装文件系统 Mounted filesystem containing the file |
| | 指向文件操作表的指针 Pointer to file operation table |
| | 文件对象的引用计数器 File object's reference counter |
| | 打开文件时指定的标志 Flags specified when opening the file |
| | 进程访问方式 Process access mode |
整数 int | f_错误 f_error | 网络写操作错误码 Error code for network write operation |
| | 当前文件偏移量(文件指针) Current file offset (file pointer) |
| | 通过信号通知 I/O 事件的数据 Data for I/O event notification via signals |
| | |
| | 用户组ID User group ID |
结构体file_ra_state struct file_ra_state | f_ra f_ra | 文件预读状态(见第16章) File read-ahead state (see Chapter 16) |
尺寸_t size_t | f_最大计数 f_maxcount | 单次操作可以读取或写入的最大字节数(当前设置为231-1) Maximum number of bytes that can be read or written with a single operation (currently set to 231-1) |
| | 版本号,每次使用后自动增加 Version number, automatically increased after each use |
空白 * void * | f_安全 f_security | 指向文件对象安全结构的指针 Pointer to file object's security structure |
| | 指向特定于文件系统或设备驱动程序的数据的指针 Pointer to data specific for a filesystem or a device driver |
结构列表头 struct list_head | f_ep_链接 f_ep_links | 该文件的事件轮询等待者列表的头 Head of the list of event poll waiters for this file |
自旋锁_t spinlock_t | f_ep_lock f_ep_lock | 自旋锁保护 Spin lock protecting the |
| | 指向文件地址空间对象的指针(参见第15章) Pointer to file's address space object (see Chapter 15) |
文件对象中存储的主要信息是 文件指针——文件中的当前位置,下一个操作将从该位置开始。由于多个进程可能同时访问同一个文件,因此文件指针必须保存在文件对象中,而不是 inode 对象中。
The main information stored in a file object is the file pointer—the current position in the file from which the next operation will take place. Because several processes may access the same file concurrently, the file pointer must be kept in the file object rather than the inode object.
文件对象通过名为filp的slab 缓存进行分配
,其描述符地址存储在
filp_cachep变量中。因为可以分配的文件对象的数量是有限制的,所以该
files_stat变量在该max_files字段中指定了可分配的文件对象的最大数量,即系统中可以同时访问的文件的最大数量。[ * ]
File objects are allocated through a slab cache named
filp, whose descriptor address is stored in the
filp_cachep variable. Because there
is a limit on the number of file objects that can be allocated, the
files_stat variable specifies in
the max_files field the maximum
number of allocatable file objects—i.e., the maximum number of files
that can be accessed at the same time in the system.[*]
“使用中”文件对象被收集在以所属文件系统的超级块为根的多个列表中。每个超级块对象在s_files字段中存储文件对象列表的头部;因此,属于不同文件系统的文件的文件对象被包括在不同的列表中。指向列表中前一个和下一个元素的指针存储在
f_list文件对象的字段中。自旋files_lock锁可保护超级块s_files列表免受多处理器系统中的并发访问。
"In use" file objects are collected in several lists rooted at
the superblocks of the owning filesystems. Each superblock object
stores in the s_files field the
head of a list of file objects; thus, file objects of files belonging
to different filesystems are included in different lists. The pointers
to the previous and next element in the list are stored in the
f_list field of the file object.
The files_lock spin lock protects
the superblock s_files lists
against concurrent accesses in multiprocessor systems.
文件对象的字段f_count是一个引用计数器:它计算正在使用该文件对象的进程数量(但请记住,使用该标志创建的轻量级进程共享标识打开文件的表,因此它们使用CLONE_FILES相同的文件对象) 。当内核本身使用文件对象时,计数器也会增加 - 例如,当对象插入到列表中时,或者当dup( ) 系统调用已发出。
The f_count field of the file
object is a reference counter: it counts the number of processes that
are using the file object (remember however that lightweight processes
created with the CLONE_FILES flag
share the table that identifies the open files, thus they use the same
file objects). The counter is also increased when the file object is
used by the kernel itself—for instance, when the object is inserted in
a list, or when a dup( ) system call has been issued.
当VFS必须代表进程打开文件时,它会调用该get_empty_filp( )函数来分配新的文件对象。该函数调用从filpkmem_cache_alloc( )缓存中获取一个空闲文件对象,然后按如下方式初始化该对象的字段:
When the VFS must open a file on behalf of a process, it invokes
the get_empty_filp( ) function to
allocate a new file object. The function invokes kmem_cache_alloc( ) to get a free file
object from the filp cache, then it initializes
the fields of the object as follows:
memset(f, 0, sizeof(*f));
INIT_LIST_HEAD(&f->f_ep_links);
spin_lock_init(&f->f_ep_lock);
atomic_set(&f->f_count, 1);
f->f_uid = 当前->fsuid;
f->f_gid = 当前->fsgid;
f->f_owner.lock = RW_LOCK_UNLOCKED;
INIT_LIST_HEAD(&f->f_list);
f->f_maxcount = INT_MAX; memset(f, 0, sizeof(*f));
INIT_LIST_HEAD(&f->f_ep_links);
spin_lock_init(&f->f_ep_lock);
atomic_set(&f->f_count, 1);
f->f_uid = current->fsuid;
f->f_gid = current->fsgid;
f->f_owner.lock = RW_LOCK_UNLOCKED;
INIT_LIST_HEAD(&f->f_list);
f->f_maxcount = INT_MAX;正如我们前面在“通用文件模型”部分中所解释的,每个文件系统都包含自己的一组文件操作 执行诸如读取和写入文件之类的活动。当内核从磁盘将索引节点加载到内存中时,它会将指向这些文件操作的指针存储在一个file_operations结构中,该结构的地址包含在i_fop索引节点对象的字段中。当进程打开文件时,VFS 使用
f_opinode 中存储的地址初始化新文件对象的字段,以便进一步调用文件操作可以使用这些函数。如果需要,VFS 稍后可以通过在 中存储新值来修改文件操作集f_op。
As we explained earlier in the section "The Common File Model,"
each filesystem includes its own set of file
operations that perform such activities as reading and writing a
file. When the kernel loads an inode into memory from disk, it stores
a pointer to these file operations in a file_operations structure whose address is
contained in the i_fop field of the
inode object. When a process opens the file, the VFS initializes the
f_op field of the new file object
with the address stored in the inode so that further calls to file
operations can use these functions. If necessary, the VFS may later
modify the set of file operations by storing a new value in f_op.
以下列表按表中出现的顺序描述了文件操作file_operations:
The following list describes the file operations in the order in
which they appear in the file_operations table:
llseek(file, offset,
origin)llseek(file, offset,
origin)更新文件指针。
Updates the file pointer.
read(file, buf, count,
offset)read(file, buf, count,
offset)count从文件中从位置 开始读取字节*offset;然后该值*offset(通常对应于文件指针)增加。
Reads count bytes from
a file starting at position *offset; the value *offset (which usually corresponds to
the file pointer) is then increased.
aio_read(req, buf, len,
pos)aio_read(req, buf, len,
pos)Starts an asynchronous I/O operation to read len bytes into buf from file position pos (introduced to support the
io_submit( ) system call).
write(file, buf, count,
offset)write(file, buf, count,
offset)从位置 开始将count字节写入文件*offset;然后该值*offset(通常对应于文件指针)增加。
Writes count bytes into
a file starting at position *offset; the value *offset (which usually corresponds to
the file pointer) is then increased.
aio_write(req, buf, len,
pos)aio_write(req, buf, len,
pos)启动异步 I/O 操作以将len字节写入buf到文件位置pos。
Starts an asynchronous I/O operation to write len bytes from buf to file position pos.
readdir(dir, dirent,
filldir)readdir(dir, dirent,
filldir)返回 中目录的下一个目录条目
dirent;该filldir参数包含提取目录条目中字段的辅助函数的地址。
Returns the next directory entry of a directory in
dirent; the filldir parameter contains the address
of an auxiliary function that extracts the fields in a directory
entry.
poll(file,
poll_table)poll(file,
poll_table)检查文件上是否有活动并进入睡眠状态,直到文件上发生某些情况。
Checks whether there is activity on a file and goes to sleep until something happens on it.
ioctl(inode, file, cmd,
arg)ioctl(inode, file, cmd,
arg)向底层硬件设备发送命令。此方法仅适用于设备文件。
Sends a command to an underlying hardware device. This method applies only to device files.
unlocked_ioctl(file, cmd,
arg)unlocked_ioctl(file, cmd,
arg)与方法类似ioctl
,但不采取大内核锁(参见第 5 章中的“大内核锁”部分)。预计所有设备驱动程序和所有文件系统都将实现此新方法而不是该
方法。ioctl
Similar to the ioctl
method, but it does not take the big kernel lock (see the section "The Big Kernel
Lock" in Chapter
5). It is expected that all device drivers and all
filesystems will implement this new method instead of the
ioctl method.
compat_ioctl(file, cmd,
arg)compat_ioctl(file, cmd,
arg)ioctl()64位内核实现32位系统调用的方法。
Method used to implement the ioctl() 32-bit system call by 64-bit
kernels.
mmap(file, vma)mmap(file, vma)Performs a memory mapping of the file into a process address space (see the section "Memory Mapping" in Chapter 16).
open(inode, file)open(inode, file)通过创建一个新的文件对象并将其链接到相应的 inode 对象来打开文件(请参阅本章后面的“ open()系统调用”部分)。
Opens a file by creating a new file object and linking it to the corresponding inode object (see the section "The open( ) System Call" later in this chapter).
flush(file)flush(file)当关闭对打开文件的引用时调用。此方法的实际目的取决于文件系统。
Called when a reference to an open file is closed. The actual purpose of this method is filesystem-dependent.
release(inode,
file)release(inode,
file)释放文件对象。当对打开文件的最后一个引用关闭时(即f_count文件对象的字段变为 0 时)调用。
Releases the file object. Called when the last reference
to an open file is closed—that is, when the f_count field of the file object
becomes 0.
fsync(file, dentry,
flag)fsync(file, dentry,
flag)通过将所有缓存数据写入磁盘来刷新文件。
Flushes the file by writing all cached data to disk.
aio_fsync(req,
flag)aio_fsync(req,
flag)启动异步 I/O 刷新操作。
Starts an asynchronous I/O flush operation.
fasync(fd, file,
on)fasync(fd, file,
on)通过信号启用或禁用 I/O 事件通知。
Enables or disables I/O event notification by means of signals.
lock(file, cmd,
file_lock)lock(file, cmd,
file_lock)对文件应用锁定(请参阅本章后面的“文件锁定”部分)。
Applies a lock to the file (see the section "File Locking" later in this chapter).
readv(file, vector, count,
offset)readv(file, vector, count,
offset)从文件中读取字节并将结果放入由vector;描述的缓冲区中 缓冲区的数量由 指定count。
Reads bytes from a file and puts the results in the
buffers described by vector;
the number of buffers is specified by count.
writev(file, vector, count,
offset)writev(file, vector, count,
offset)vector将字节从;描述的缓冲区写入文件
缓冲区的数量由 指定count。
Writes bytes into a file from the buffers described by
vector; the number of buffers
is specified by count.
sendfile(in_file, offset, count,
file_send_actor, out_file)sendfile(in_file, offset, count,
file_send_actor, out_file)Transfers data from in_file to out_file (introduced to support the
sendfile( ) system call).
sendpage(file, page, offset, size,
pointer, fill)sendpage(file, page, offset, size,
pointer, fill)将数据从file页面缓存传输到页面缓存page;sendfile( )这是套接字网络代码使用的低级方法。
Transfers data from file to the page cache's page; this is a low-level method used
by sendfile( ) and by the
networking code for sockets.
get_unmapped_area(file, addr, len,
offset, flags)get_unmapped_area(file, addr, len,
offset, flags)获取未使用的地址范围来映射文件。
Gets an unused address range to map the file.
check_flags(flags)check_flags(flags)由服务例程调用的方法fcntl( ) 设置文件状态标志时执行附加检查的系统调用(F_SETFL命令)。目前仅由 NFS 使用网络文件系统。
Method invoked by the service routine of the fcntl( ) system call to perform additional checks when
setting the status flags of a file (F_SETFL command). Currently used only
by the NFS network filesystem.
dir_notify(file,
arg)dir_notify(file,
arg)fcntl( )建立目录更改通知(命令)时由系统调用的服务例程调用的方法F_NOTIFY。目前仅由通用 Internet 文件系统 (CIFS) 使用) 网络文件系统。
Method invoked by the service routine of the fcntl( ) system call when establishing
a directory change notification (F_NOTIFY command). Currently used only
by the Common Internet File System (CIFS ) network filesystem.
flock(file, flag,
lock)flock(file, flag,
lock)用于定制系统调用的行为flock()。没有官方的 Linux 文件系统使用此方法。
Used to customize the behavior of the flock() system call. No official Linux
filesystem makes use of this method.
刚刚描述的方法适用于所有可能的文件类型。但是,它们中只有一部分适用于特定的文件类型;与未实现的方法对应的字段设置为NULL。
The methods just described are available to all possible file
types. However, only a subset of them apply to a specific file type;
the fields corresponding to unimplemented methods are set to NULL.
我们在“通用文件模型”一节中提到,VFS 将每个目录视为一个文件,其中包含文件和其他目录的列表。我们将在第 18 章讨论如何在特定文件系统上实现目录。然而,一旦目录项被读入内存,VFS就会根据结构将其转换为dentry对象,其字段如表12-5dentry
所示。内核为进程查找的路径名的每个组成部分创建一个 dentry 对象;dentry 对象将组件与其相应的 inode 相关联。例如,查找/tmp/test时pathname 时,内核为/根目录创建一个 dentry 对象,为根目录的tmp条目创建第二个 dentry 对象,为/tmp目录
的test条目创建第三个 dentry 对象
。
We mentioned in the section "The Common File Model"
that the VFS considers each directory a file that contains a list of
files and other directories. We will discuss in Chapter 18 how directories are
implemented on a specific filesystem. Once a directory entry is read
into memory, however, it is transformed by the VFS into a dentry
object based on the dentry
structure, whose fields are described in Table 12-5. The kernel
creates a dentry object for every component of a pathname that a
process looks up; the dentry object associates the component to its
corresponding inode. For example, when looking up the /tmp/test pathname, the kernel creates a
dentry object for the / root
directory, a second dentry object for the tmp entry of the root directory, and a
third dentry object for the test
entry of the /tmp
directory.
注意 dentry 对象磁盘上没有相应的图像,因此结构中不包含任何字段dentry来指定对象已被修改。Dentry对象存储在slab分配器缓存中,其描述符为dentry_cache;因此,dentry 对象是通过调用kmem_cache_alloc( )和来创建和销毁的kmem_cache_free( )。
Notice that dentry objects have no corresponding image on disk, and hence no field
is included in the dentry structure
to specify that the object has been modified. Dentry objects are
stored in a slab allocator cache whose descriptor is dentry_cache; dentry objects are thus
created and destroyed by invoking kmem_cache_alloc( ) and kmem_cache_free( ).
表 12-5。dentry对象的字段
Table 12-5. The fields of the dentry object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | Dentry对象使用计数器 Dentry object usage counter |
| | Dentry缓存标志 Dentry cache flags |
自旋锁_t spinlock_t | d_lock d_lock | 自旋锁保护 dentry 对象 Spin lock protecting the dentry object |
| | 与文件名关联的索引节点 Inode associated with filename |
| | 父目录的Dentry对象 Dentry object of parent directory |
| | 文件名 Filename |
| | 未使用的 dentry 列表的指针 Pointers for the list of unused dentries |
| | 对于目录,指向同一父目录中目录项列表的指针 For directories, pointers for the list of directory dentries in the same parent directory |
| | 对于目录,子目录 dentry 列表的头部 For directories, head of the list of subdirectory dentries |
| | 与同一 inode(别名)关联的 dentry 列表的指针 Pointers for the list of dentries associated with the same inode (alias) |
| |
Used by |
| | 牙科方法 Dentry methods |
| | 文件的超级块对象 Superblock object of the file |
| | 依赖于文件系统的数据 Filesystem-dependent data |
结构体rcu_head struct rcu_head | d_rcu d_rcu | 回收dentry对象时使用的RCU描述符(参见第5章“读-复制更新(RCU) ”部分) The RCU descriptor used when reclaiming the dentry object (see the section "Read-Copy Update (RCU)" in Chapter 5) |
结构 dcookie_struct * struct dcookie_struct * | d_cookie d_cookie | 指向内核分析器使用的结构的指针 Pointer to structure used by kernel profilers |
| | 哈希表条目中列表的指针 Pointer for list in hash table entry |
| | 对于目录,计数器装载在此 dentry 上的文件系统数量 For directories, counter for the number of filesystems mounted on this dentry |
| | 短文件名的空格 Space for short filename |
每个 dentry 对象可能处于以下四种状态之一:
Each dentry object may be in one of four states:
dentry 对象不包含有效信息,并且不被 VFS 使用。相应的内存区域由slab分配器处理。
The dentry object contains no valid information and is not used by the VFS. The corresponding memory area is handled by the slab allocator.
dentry 对象当前未被内核使用。d_count对象的使用计数器为0,但该字段
d_inode
仍然指向关联的inode。dentry 对象包含有效信息,但如果需要的话,为了回收内存,其内容可能会被丢弃。
The dentry object is not currently used by the kernel. The
d_count usage counter of the
object is 0, but the d_inode
field still points to the associated inode. The dentry object
contains valid information, but its contents may be discarded if
necessary in order to reclaim memory.
dentry对象当前被内核使用。使用计数器
d_count为正,该d_inode
字段指向关联的 inode 对象。dentry对象包含有效信息并且不能被丢弃。
The dentry object is currently used by the kernel. The
d_count usage counter is
positive, and the d_inode
field points to the associated inode object. The dentry object
contains valid information and cannot be discarded.
与 dentry 关联的 inode 不存在,因为相应的磁盘 inode 已被删除,或者因为 dentry 对象是通过解析不存在文件的路径名创建的。d_inodedentry对象的字段被设置为,NULL但该对象仍然保留在dentry缓存中,以便可以快速解析对同一文件路径名的进一步查找操作。“负”一词有些误导,因为不涉及负值。
The inode associated with the dentry does not exist,
either because the corresponding disk inode has been deleted or
because the dentry object was created by resolving a pathname of
a nonexistent file. The d_inode field of the dentry object is
set to NULL, but the object
still remains in the dentry cache, so that further lookup
operations to the same file pathname can be quickly resolved.
The term "negative" is somewhat misleading, because no negative
value is involved.
与 dentry 对象关联的方法称为
dentry 操作 ; 它们由dentry_operations结构体描述,其地址存储在d_op字段中。尽管某些文件系统定义了自己的 dentry 方法,但这些字段通常是这样的NULL,并且 VFS 用默认函数替换它们。以下是这些方法(按照它们在dentry_operations表中出现的顺序排列):
The methods associated with a dentry object are called
dentry operations ; they are described by the dentry_operations structure, whose address
is stored in the d_op field.
Although some filesystems define their own dentry methods, the fields
are usually NULL and the VFS
replaces them with default functions. Here are the methods, in the
order they appear in the dentry_operations table:
d_revalidate(dentry,
nameidata)d_revalidate(dentry,
nameidata)在使用 dentry 对象转换文件路径名之前确定该 dentry 对象是否仍然有效。尽管网络文件系统默认的 VFS 功能不执行任何操作可以指定自己的功能。
Determines whether the dentry object is still valid before using it for translating a file pathname. The default VFS function does nothing, although network filesystems may specify their own functions.
d_hash(dentry,
name)d_hash(dentry,
name)创建哈希值;该函数是用于 dentry 哈希表的文件系统特定的哈希函数。该
dentry参数标识包含该组件的目录。该name参数指向一个结构,其中包含要查找的路径名组件和哈希函数生成的值。
Creates a hash value; this function is a
filesystem-specific hash function for the dentry hash table. The
dentry parameter identifies
the directory containing the component. The name parameter points to a structure
containing both the pathname component to be looked up and the
value produced by the hash function.
d_compare(dir, name1,
name2)d_compare(dir, name1,
name2)比较两个文件名; name1应该属于 引用的目录dir。默认的VFS函数是普通的字符串匹配。但是,每个文件系统都可以以自己的方式实现此方法。例如,MS-DOS不区分大小写字母。
Compares two filenames ; name1 should
belong to the directory referenced by dir. The default VFS function is a
normal string match. However, each filesystem can implement this
method in its own way. For instance, MS-DOS does not distinguish capital from lowercase
letters.
d_delete(dentry)d_delete(dentry)当对 dentry 对象的最后一个引用被删除(d_count变为 0)时调用。默认的 VFS 功能不执行任何操作。
Called when the last reference to a dentry object is
deleted (d_count becomes 0).
The default VFS function does nothing.
d_release(dentry)d_release(dentry)当将要释放 dentry 对象(释放到slab 分配器)时调用。默认的 VFS 功能不执行任何操作。
Called when a dentry object is going to be freed (released to the slab allocator). The default VFS function does nothing.
d_iput(dentry,
ino)d_iput(dentry,
ino)当 dentry 对象变为“负”时(即丢失其 inode)时调用。默认的VFS函数调用iput( )来释放inode对象。
Called when a dentry object becomes "negative"—that is, it
loses its inode. The default VFS function invokes iput( ) to release the inode
object.
由于从磁盘读取目录条目并构造相应的 dentry 对象需要相当长的时间,因此将已完成但稍后可能需要的 dentry 对象保留在内存中是有意义的。例如,人们经常编辑一个文件然后编译它,或者编辑并打印它,或者复制它然后编辑副本。在这种情况下,需要重复访问同一文件。
Because reading a directory entry from disk and constructing the corresponding dentry object requires considerable time, it makes sense to keep in memory dentry objects that you've finished with but might need later. For instance, people often edit a file and then compile it, or edit and print it, or copy it and then edit the copy. In such cases, the same file needs to be repeatedly accessed.
为了最大限度地提高处理 dentry 的效率,Linux 使用 dentry 缓存,它由两种数据结构组成:
To maximize efficiency in handling dentries, Linux uses a dentry cache, which consists of two kinds of data structures:
一组处于使用中、未使用或负状态的 dentry 对象。
A set of dentry objects in the in-use, unused, or negative state.
一个哈希表,用于快速导出与给定文件名和给定目录关联的 dentry 对象。与往常一样,如果所需的对象未包含在 dentry 缓存中,则搜索函数将返回 null 值。
A hash table to derive the dentry object associated with a given filename and a given directory quickly. As usual, if the required object is not included in the dentry cache, the search function returns a null value.
dentry 缓存还充当inode 缓存的控制器 。内核内存中与未使用的 dentry 关联的 inode 不会被丢弃,因为 dentry 缓存仍在使用它们。因此,inode 对象保存在 RAM 中,并且可以通过相应的 dentry 快速引用。
The dentry cache also acts as a controller for an inode cache . The inodes in kernel memory that are associated with unused dentries are not discarded, because the dentry cache is still using them. Thus, the inode objects are kept in RAM and can be quickly referenced by means of the corresponding dentries.
所有“未使用”的目录项都包含在按插入时间排序的双向链接“最近最少使用”列表中。换句话说,最后释放的 dentry 对象被放在列表的前面,因此最近最少使用的 dentry 对象总是靠近列表的末尾。当 dentry 缓存必须缩小时,内核会从该列表的尾部删除元素,以便保留最近使用的对象。LRU 列表的第一个和最后一个元素的地址存储在类型变量的next和
prev字段中。dentry 对象的字段包含指向列表中相邻 dentry 的指针。dentry_unusedlist_headd_lru
All the "unused" dentries are included in a doubly linked "Least
Recently Used" list sorted by time of insertion. In other words, the
dentry object that was last released is put in front of the list, so
the least recently used dentry objects are always near the end of the
list. When the dentry cache has to shrink, the kernel removes elements
from the tail of this list so that the most recently used objects are
preserved. The addresses of the first and last elements of the LRU
list are stored in the next and
prev fields of the dentry_unused variable of type list_head. The d_lru field of the dentry object contains
pointers to the adjacent dentries in the list.
每个“使用中”的 dentry 对象都被插入到由相应 inode 对象的字段指定的双向链表中i_dentry
(因为每个 inode 可能与多个硬链接相关联,所以需要一个列表)。dentry对象的字段d_alias存储列表中相邻元素的地址。这两个字段的类型都是struct list_head。
Each "in use" dentry object is inserted into a doubly linked
list specified by the i_dentry
field of the corresponding inode object (because each inode could be
associated with several hard links, a list is required). The d_alias field of the dentry object stores
the addresses of the adjacent elements in the list. Both fields are of
type struct list_head.
当删除对应文件的最后一个硬链接时,“正在使用”的 dentry 对象可能会变为“负”。在这种情况下,该 dentry 对象被移动到未使用 dentry 的 LRU 列表中。每次内核收缩 dentry 缓存时,负 dentry 都会向 LRU 列表的尾部移动,从而逐渐释放它们(请参阅第 17 章中的“回收可收缩磁盘缓存的页面”一节)。
An "in use" dentry object may become "negative" when the last hard link to the corresponding file is deleted. In this case, the dentry object is moved into the LRU list of unused dentries. Each time the kernel shrinks the dentry cache, negative dentries move toward the tail of the LRU list so that they are gradually freed (see the section "Reclaiming Pages of Shrinkable Disk Caches" in Chapter 17).
哈希表是通过dentry_hashtable数组来实现的。每个元素都是一个指向目录项列表的指针,这些目录项哈希到相同的哈希表值。阵列的大小通常取决于系统中安装的 RAM 量;默认值为每兆 RAM 256 个条目。dentry 对象的字段
d_hash包含指向与单个哈希值关联的列表中的相邻元素的指针。哈希函数从目录的 dentry 对象和文件名生成其值。
The hash table is implemented by means of a dentry_hashtable array. Each element is a
pointer to a list of dentries that hash to the same hash table value.
The array's size usually depends on the amount of RAM installed in the
system; the default value is 256 entries per megabyte of RAM. The
d_hash field of the dentry object
contains pointers to the adjacent elements in the list associated with
a single hash value. The hash function produces its value from both
the dentry object of the directory and the filename.
自旋dcache_lock锁保护目录项缓存数据结构免受多处理器系统中的并发访问。该d_lookup(
)函数在哈希表中查找给定的父目录项对象和文件名;为了避免竞争条件,它使用了 seqlock(请参阅第 5 章中的“ Seqlocks ”部分)。该函数类似,但它假设不会发生竞争条件,因此它不使用 seqlock。_ _d_lookup( )
The dcache_lock spin lock
protects the dentry cache data structures against concurrent accesses
in multiprocessor systems. The d_lookup(
) function looks in the hash table for a given parent dentry
object and filename; to avoid race conditions, it makes use of a
seqlock (see the section "Seqlocks" in Chapter 5). The _ _d_lookup( ) function is similar, but it
assumes that no race condition can happen, so it does not use the
seqlock.
我们在第一章“ Unix文件系统概述”一节中提到,每个进程都有自己的当前工作目录和自己的根目录。这些只是内核必须维护的两个数据示例,用于表示进程和文件系统之间的交互。整个类型的数据结构用于此目的(参见表
12-6),并且每个进程描述符都有一个
指向进程结构的字段。fs_structfsfs_struct
We mentioned in the section "An Overview of the Unix
Filesystem" in Chapter
1 that each process has its own current working directory and
its own root directory. These are only two examples of data that must
be maintained by the kernel to represent the interactions between a
process and a filesystem. A whole data structure of type fs_struct is used for that purpose (see
Table 12-6), and
each process descriptor has an fs
field that points to the process fs_struct structure.
表 12-6。fs_struct 结构体的字段
Table 12-6. The fields of the fs_struct structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 共享该表的进程数 Number of processes sharing this table |
| | 表字段的读/写自旋锁 Read/write spin lock for the table fields |
| | 打开文件时使用位掩码来设置文件权限 Bit mask used when opening the file to set the file permissions |
| | 根目录的dentry Dentry of the root directory |
| | 当前工作目录的dentry Dentry of the current working directory |
| | 模拟根目录的 Dentry(始终 Dentry of the emulated root
directory (always |
| | 根目录挂载的文件系统对象 Mounted filesystem object of the root directory |
| | 当前工作目录的已挂载文件系统对象 Mounted filesystem object of the current working directory |
| | 模拟根目录的已挂载文件系统对象(始终 Mounted filesystem object of the
emulated root directory (always |
第二个表的地址包含在files进程描述符的字段中,指定进程当前打开哪些文件。该
结构体的字段如表12-7files_struct所示。
A second table, whose address is contained in the files field of the process descriptor,
specifies which files are currently opened by the process. It is a
files_struct structure whose fields
are illustrated in Table
12-7.
表 12-7。files_struct结构体的字段
Table 12-7. The fields of the files_struct structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 共享该表的进程数 Number of processes sharing this table |
| | 表字段的读/写自旋锁 Read/write spin lock for the table fields |
| | 当前最大文件对象数 Current maximum number of file objects |
| | |
| | 分配的最大文件描述符加 1 Maximum file descriptors ever allocated plus 1 |
| | 指向文件对象指针数组的指针 Pointer to array of file object pointers |
| | 指向要关闭的文件描述符的指针 Pointer to file descriptors to be
closed on |
| | 指向打开文件描述符的指针 Pointer to open file descriptors |
| | 要关闭的初始文件描述符集 Initial set of file descriptors to
be closed on |
| | 初始文件描述符集 Initial set of file descriptors |
| | 文件对象指针的初始数组 Initial array of file object pointers |
该fd字段指向文件对象的指针数组。数组的大小存储在字段中max_fds。通常,
fd指向fd_array该结构体的字段files_struct,其中包含32个文件对象指针。如果进程打开的文件超过 32 个,内核会分配一个新的、更大的文件指针数组,并将其地址存储在字段中fd;它还会更新该max_fds字段。
The fd field points to an
array of pointers to file objects. The size of the array is stored in
the max_fds field. Usually,
fd points to the fd_array field of the files_struct structure, which includes 32
file object pointers. If the process opens more than 32 files, the
kernel allocates a new, larger array of file pointers and stores its
address in the fd fields; it also
updates the max_fds field.
对于数组中具有条目的每个文件fd,数组索引是
文件描述符。通常,数组的第一个元素(索引 0)与进程的标准输入相关联,第二个元素与标准输出相关联,第三个元素与标准错误相关联(见图 12-3 )。Unix 进程使用文件描述符作为主文件标识符。请注意,由于dup(
) ,dup2( ) , 和fcntl( )
在系统调用中,两个文件描述符可能引用同一个打开的文件,即数组的两个元素可能指向同一个文件对象。2>&1用户在使用 shell 结构(例如将标准错误重定向到标准输出)时总是会看到这种情况。
For every file with an entry in the fd array, the array index is the
file descriptor. Usually, the first element
(index 0) of the array is associated with the standard input of the
process, the second with the standard output, and the third with the
standard error (see Figure
12-3). Unix processes use the file descriptor as the main file
identifier. Notice that, thanks to the dup(
) , dup2( ) , and fcntl( )
system calls, two file descriptors may refer to the
same opened file—that is, two elements of the array could point to the
same file object. Users see this all the time when they use shell
constructs such as 2>&1 to
redirect the standard error to the standard output.
进程不能使用超过NR_OPEN(通常为 1、048、576)个文件描述符。signal->rlim[RLIMIT_NOFILE]内核还对进程描述符结构中的文件描述符的最大数量强制执行动态限制;该值通常为 1,024,但如果进程具有 root 权限,则可以提高该值。
A process cannot use more than NR_OPEN (usually, 1, 048, 576) file
descriptors. The kernel also enforces a dynamic bound on the maximum
number of file descriptors in the signal->rlim[RLIMIT_NOFILE] structure of
the process descriptor; this value is usually 1,024, but it can be
raised if the process has root privileges.
该open_fds字段最初包含该字段的地址open_fds_init,它是一个位图,标识当前打开的文件的文件描述符。该
max_fdset字段存储位图中的位数。由于fd_set数据结构包括 1,024 位,因此通常不需要扩展位图的大小。但是,如果有必要,内核可以动态扩展位图的大小,就像文件对象数组的情况一样。
The open_fds field initially
contains the address of the open_fds_init field, which is a bitmap that
identifies the file descriptors of currently opened files. The
max_fdset field stores the number
of bits in the bitmap. Because the fd_set data structure includes 1,024 bits,
there is usually no need to expand the size of the bitmap. However,
the kernel may dynamically expand the size of the bitmap if this turns
out to be necessary, much as in the case of the array of file
objects.
内核提供了一个fget(
)在内核开始使用文件对象时调用的函数。该函数接收一个文件描述符作为其参数
fd。它返回in的地址
current->files->fd[fd](即对应文件对象的地址),或者NULL如果没有文件对应fd。在第一种情况下,fget( )将文件对象使用计数器增加f_count1。
The kernel provides an fget(
) function to be invoked when the kernel starts using a file
object. This function receives as its parameter a file descriptor
fd. It returns the address in
current->files->fd[fd] (that
is, the address of the corresponding file object), or NULL if no file corresponds to fd. In the first case, fget( ) increases the file object usage
counter f_count by 1.
内核还提供了fput(
)当内核控制路径完成使用文件对象时要调用的函数。该函数接收文件对象的地址作为其参数,并减少其使用计数器f_count。此外,如果该字段变为 0,该函数将调用release
文件操作的方法(如果已定义),减少i_writecountinode 对象中的字段(如果文件被打开用于写入),从超级块列表中删除文件对象,释放文件对象分配给slab分配器,并减少关联的dentry对象和文件系统描述符的使用计数器(参见后面的“文件系统挂载”部分)。
The kernel also provides an fput(
) function to be invoked when a kernel control path finishes
using a file object. This function receives as its parameter the
address of a file object and decreases its usage counter, f_count. Moreover, if this field becomes 0,
the function invokes the release
method of the file operations (if defined), decreases the i_writecount field in the inode object (if
the file was opened for writing), removes the file object from the
superblock's list, releases the file object to the slab allocator, and
decreases the usage counters of the associated dentry object and of
the filesystem descriptor (see the later section "Filesystem
Mounting).
和函数是和的更快版本:当内核可以安全地假设当前进程已经拥有该文件对象时(即该进程之前已经增加了文件对象的引用计数器),内核就会使用它们fget_light( )。
例如,它们由接收文件描述符作为参数的系统调用的服务例程使用,因为文件对象的引用计数器已增加了先前的值。fput_light( )fget( )fput( )open( )
系统调用。
The fget_light( ) and
fput_light( ) functions are faster
versions of fget( ) and fput( ): the kernel uses them when it can
safely assume that the current process already owns the file
object—that is, the process has already previously increased the file
object's reference counter. For instance, they are used by the service
routines of the system calls that receive a file descriptor as an
argument, because the file object's reference counter has been
increased by a previous open( )
system call.
[ * ]配额系统为每个用户和/或组定义了给定文件系统上可以使用的空间量限制(请参阅系统quotactl()调用。)
[*] The quota system defines for each
user and/or group limits on the amount of space that can be
used on a given filesystem (see the quotactl() system call.)
[ * ]该files_init( )
函数在内核初始化期间执行,将该max_files字段设置为可用 RAM(以千字节为单位)的十分之一,但系统管理员可以通过写入
/proc/sys/fs/file-max文件来调整此参数。此外,超级用户始终可以获得文件对象,即使max_files文件对象已经被分配。
[*] The files_init( )
function, executed during kernel initialization, sets the max_files field to one-tenth of the
available RAM in kilobytes, but the system administrator can tune
this parameter by writing into the
/proc/sys/fs/file-max file. Moreover, the
superuser can always get a file object, even if max_files file objects have already been
allocated.
Linux 内核支持多种不同类型的文件系统。下面,我们介绍几种特殊类型的文件系统,它们在 Linux 内核的内部设计中发挥着重要作用。
The Linux kernel supports many different types of filesystems. In the following, we introduce a few special types of filesystems that play an important role in the internal design of the Linux kernel.
接下来,我们将讨论文件系统注册,即在使用文件系统类型之前通常在系统初始化期间必须执行的基本操作。一旦文件系统被注册,它的特定功能就可供内核使用,因此该类型的文件系统可以挂载到系统的目录树上。
Next, we'll discuss filesystem registration—that is, the basic operation that must be performed, usually during system initialization, before using a filesystem type. Once a filesystem is registered, its specific functions are available to the kernel, so that type of filesystem can be mounted on the system's directory tree.
虽然网络和基于磁盘的文件系统使用户能够处理存储在内核外部的信息,但特殊的文件系统可以为系统程序和管理员提供一种简单的方法来操作内核的数据结构并实现操作系统的特殊功能。表 12-8列出了 Linux 中最常见的特殊文件系统;对于它们中的每一个,该表都会报告其建议的安装点和简短描述。
While network and disk-based filesystems enable the user to handle information stored outside the kernel, special filesystems may provide an easy way for system programs and administrators to manipulate the data structures of the kernel and to implement special features of the operating system. Table 12-8 lists the most common special filesystems used in Linux; for each of them, the table reports its suggested mount point and a short description.
请注意,一些文件系统没有固定的安装点(表中的关键字“any”)。这些文件系统可以由用户自由安装和使用。此外,其他一些特殊文件系统根本没有挂载点(表中的关键字“none”)。它们不用于用户交互,但内核可以使用它们轻松地重用一些VFS层代码;例如,我们将在第 19 章中看到,多亏了 Pipefs 特殊的文件系统,管道可以像 FIFO 文件一样对待。
Notice that a few filesystems have no fixed mount point (keyword "any" in the table). These filesystems can be freely mounted and used by the users. Moreover, some other special filesystems do not have a mount point at all (keyword "none" in the table). They are not for user interaction, but the kernel can use them to easily reuse some of the VFS layer code; for instance, we'll see in Chapter 19 that, thanks to the pipefs special filesystem, pipes can be treated in the same way as FIFO files.
表 12-8。最常见的特殊文件系统
Table 12-8. Most common special filesystems
姓名 Name | 挂载点 Mount point | 描述 Description |
|---|---|---|
没有任何 none | 块设备(参见第13章) Block devices (see Chapter 13) | |
任何 any | 其他可执行格式(参见第 20 章) Miscellaneous executable formats (see Chapter 20) | |
/dev/点 /dev/pts | 伪终端支持(Open Group 的 Unix98 标准) Pseudoterminal support (Open Group's Unix98 standard) | |
没有任何 none | 由高效的事件轮询机制使用 Used by the efficient event polling mechanism | |
没有任何 none | 由futex(快速用户空间锁定)机制使用 Used by the futex (Fast Userspace Locking) mechanism | |
管道 pipefs | 没有任何 none | 管道(参见第 19 章) Pipes (see Chapter 19) |
/进程 /proc | 内核数据结构的通用访问点 General access point to kernel data structures | |
没有任何 none | 为引导阶段提供一个空的根目录 Provides an empty root directory for the bootstrap phase | |
没有任何 none | IPC共享内存区域(参见第19章) IPC-shared memory regions (see Chapter 19) | |
任何 any | 用于实现POSIX消息队列(见第十九章) Used to implement POSIX message queues (see Chapter 19) | |
没有任何 none | 插座 Sockets | |
/系统 /sys | 系统数据的通用访问点(参见第 13 章) General access point to system data (see Chapter 13) | |
任何 any | 临时文件(除非交换,否则保存在 RAM 中) Temporary files (kept in RAM unless swapped) | |
/proc/总线/USB /proc/bus/usb | USB设备 USB devices |
特殊文件系统不绑定到物理块设备。然而,内核为每个已安装的特殊文件系统分配一个虚构的块设备,该块设备的主编号为 0,次编号为任意值(每个特殊文件系统不同)。该set_anon_super( )
函数用于初始化特殊文件系统的超级块;该函数本质上获取未使用的次要编号,并将新超级块的字段dev设置为主要编号 0 和次要编号。另一个称为删除特殊文件系统的超级块的函数。这
s_devdevkill_anon_super(
)unnamed_dev_idr变量包括指向记录当前使用的次要编号的辅助结构的指针。尽管一些内核设计者不喜欢虚构的块设备标识符,但它们帮助内核以统一的方式处理特殊文件系统和常规文件系统。
Special filesystems are not bound to physical block devices.
However, the kernel assigns to each mounted special filesystem a
fictitious block device that has the value 0 as major number and an
arbitrary value (different for each special filesystem) as a minor
number. The set_anon_super( )
function is used to initialize superblocks of special filesystems;
this function essentially gets an unused minor number dev and sets the s_dev field of the new superblock with major
number 0 and minor number dev.
Another function called kill_anon_super(
) removes the superblock of a special filesystem. The
unnamed_dev_idr variable includes
pointers to auxiliary structures that record the minor numbers
currently in use. Although some kernel designers dislike the
fictitious block device identifiers, they help the kernel to handle
special filesystems and regular ones in a uniform way.
我们将在后面的“挂载通用文件系统”部分中看到内核如何定义和初始化特殊文件系统的实际示例。
We'll see a practical example of how the kernel defines and initializes a special filesystem in the later section "Mounting a Generic Filesystem."
通常,用户将 Linux 配置为识别为其系统编译内核时所需的所有文件系统。但文件系统的代码实际上可以包含在内核映像中,也可以作为模块动态加载(请参阅附录 B)。VFS 必须跟踪其代码当前包含在内核中的所有文件系统类型。它通过执行文件系统类型注册来做到这一点 。
Often, the user configures Linux to recognize all the filesystems needed when compiling the kernel for his system. But the code for a filesystem actually may either be included in the kernel image or dynamically loaded as a module (see Appendix B). The VFS must keep track of all filesystem types whose code is currently included in the kernel. It does this by performing filesystem type registration .
每个注册的文件系统都表示为一个对象,其字段如表 12-9file_system_type所示。
Each registered filesystem is represented as a file_system_type object whose fields are
illustrated in Table
12-9.
表 12-9。file_system_type 对象的字段
Table 12-9. The fields of the file_system_type object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 文件系统名称 Filesystem name |
| | 文件系统类型标志 Filesystem type flags |
| | 读取超级块的方法 Method for reading a superblock |
空白 (*)( ) void (*)( ) | 杀掉某人 kill_sb | 删除超级块的方法 Method for removing a superblock |
| | 指向实现文件系统的模块的指针(参见附录 B) Pointer to the module implementing the filesystem (see Appendix B) |
| | 指向文件系统类型列表中下一个元素的指针 Pointer to the next element in the list of filesystem types |
| | 具有相同文件系统类型的超级块对象列表的头部 Head of a list of superblock objects having the same filesystem type |
所有文件系统类型的对象都插入到单链表中。变量file_systems指向第一项,而next结构体的字段则指向列表中的下一项。读file_systems_lock/写自旋锁可保护整个列表免受并发访问。
All filesystem-type objects are inserted into a singly linked
list. The file_systems variable
points to the first item, while the next field of the structure points to the
next item in the list. The file_systems_lock read/write spin lock
protects the whole list against concurrent accesses.
该fs_supers字段表示与给定类型的已安装文件系统相对应的超级块对象列表的头(第一个虚拟元素)。列表元素的后向和前向链接存储在
s_instances超级块对象的字段中。
The fs_supers field
represents the head (first dummy element) of a list of superblock
objects corresponding to mounted filesystems of the given type. The
backward and forward links of a list element are stored in the
s_instances field of the superblock
object.
该get_sb字段指向与文件系统类型相关的函数,该函数分配新的超级块对象并对其进行初始化(如有必要,通过读取磁盘)。该
kill_sb字段指向销毁超级块的函数。
The get_sb field points to
the filesystem-type-dependent function that allocates a new superblock
object and initializes it (if necessary, by reading a disk). The
kill_sb field points to the
function that destroys a superblock.
该字段存储了几个标志,如表12-10fs_flags所示。
The fs_flags field stores
several flags, which are listed in Table 12-10.
表 12-10。文件系统类型标志
Table 12-10. The filesystem type flags
姓名 Name | 描述 Description |
|---|---|
| 这种类型的每个文件系统都必须位于物理磁盘设备上。 Every filesystem of this type must be located on a physical disk device. |
| 文件系统使用二进制安装数据。 The filesystem uses binary mount data. |
| 始终重新验证“.” 和 dentry 缓存中的“..”路径(对于网络文件系统)。 Always revalidate the "." and ".." paths in the dentry cache (for network filesystems). |
| “重命名”操作是“移动”操作(对于网络文件系统)。 "Rename" operations are "move" operations (for network filesystems). |
在系统初始化期间,register_filesystem( )为编译时指定的每个文件系统调用该函数;该函数将相应的file_system_type
对象插入到文件系统类型列表中。
During system initialization, the register_filesystem( ) function is invoked
for every filesystem specified at compile time; the function inserts
the corresponding file_system_type
object into the filesystem-type list.
register_filesystem( )
当加载实现文件系统的模块时也会调用该函数。unregister_filesystem(
)在这种情况下,当模块被卸载时,文件系统也可能被取消注册(通过调用该函数)。
The register_filesystem( )
function is also invoked when a module implementing a filesystem is
loaded. In this case, the filesystem may also be unregistered (by
invoking the unregister_filesystem(
) function) when the module is unloaded.
该get_fs_type( )函数接收文件系统名称作为参数,扫描注册文件系统列表,查看name其描述符字段,并返回指向相应file_system_type对象的指针(如果存在)。
The get_fs_type( ) function,
which receives a filesystem name as its parameter, scans the list of
registered filesystems looking at the name field of their descriptors, and returns
a pointer to the corresponding file_system_type object, if it is
present.
与每个传统的 Unix 系统一样,Linux 使用 系统的根文件系统 :是内核在启动阶段直接挂载的文件系统,保存着系统初始化脚本和最基本的系统程序。
Like every traditional Unix system, Linux makes use of a system's root filesystem : it is the filesystem that is directly mounted by the kernel during the booting phase and that holds the system initialization scripts and the most essential system programs.
其他文件系统可以通过初始化脚本或直接由用户安装在已安装文件系统的目录上。作为目录树,每个文件系统都有自己的 根目录。挂载文件系统的目录称为挂载点。已挂载的文件系统是挂载点目录所属的已挂载文件系统的子级。例如,/proc虚拟文件系统是系统根文件系统的子文件系统(系统的根文件系统是 /proc的父文件系统))。已挂载文件系统的根目录隐藏了父文件系统挂载点目录的内容,以及挂载点下方父文件系统的整个子树。[ * ]
Other filesystems can be mounted—either by the initialization scripts or directly by the users—on directories of already mounted filesystems. Being a tree of directories, every filesystem has its own root directory. The directory on which a filesystem is mounted is called the mount point. A mounted filesystem is a child of the mounted filesystem to which the mount point directory belongs. For instance, the /proc virtual filesystem is a child of the system's root filesystem (and the system's root filesystem is the parent of /proc). The root directory of a mounted filesystem hides the content of the mount point directory of the parent filesystem, as well as the whole subtree of the parent filesystem below the mount point.[*]
在传统的 Unix 系统中,只有一棵挂载文件系统树:从系统的根文件系统开始,每个进程都可以通过指定正确的路径名来访问挂载文件系统中的每个文件。在这方面,Linux 2.6 更加完善:每个进程都可能有自己的已挂载文件系统树,即所谓的进程命名空间。
In a traditional Unix system, there is only one tree of mounted filesystems: starting from the system's root filesystem, each process can potentially access every file in a mounted filesystem by specifying the proper pathname. In this respect, Linux 2.6 is more refined: every process might have its own tree of mounted filesystems—the so-called namespace of the process.
通常大多数进程共享相同的命名空间,该命名空间是已安装文件系统的树,该树以系统的根文件系统为根,并由init进程使用。但是,如果进程是由
clone( ) 设置了标志的系统调用(请参阅第 3 章中的“ clone()、fork() 和 vfork() 系统调用”CLONE_NEWNS部分)。如果父进程在没有该
标志的情况下创建子进程,则新的命名空间将由子进程继承。CLONE_NEWNS
Usually most processes share the same namespace, which is the
tree of mounted filesystems that is rooted at the system's root
filesystem and that is used by the init process.
However, a process gets a new namespace if it is created by the
clone( ) system call with the CLONE_NEWNS flag set (see the section "The clone( ), fork( ), and
vfork( ) System Calls" in Chapter 3). The new namespace is
then inherited by children processes if the parent creates them
without the CLONE_NEWNS
flag.
当进程挂载或卸载文件系统时,它仅修改其名称空间。因此,更改对于共享同一命名空间的所有进程都是可见的,并且仅对它们可见。进程甚至可以使用 Linux 特定的命令来更改其命名空间的根文件系统
pivot_root( ) 系统调用。
When a process mounts—or unmounts—a filesystem, it only modifies
its namespace. Therefore, the change is visible to all processes that
share the same namespace, and only to them. A process can even change
the root filesystem of its namespace by using the Linux-specific
pivot_root( ) system call.
进程的命名空间由进程描述符的字段namespace指向的结构
表示。namespace该结构体的字段如表12-11namespace所示。
The namespace of a process is represented by a namespace structure pointed to by the
namespace field of the process
descriptor. The fields of the namespace structure are shown in Table 12-11.
表 12-11。命名空间结构的字段
Table 12-11. The fields of the namespace structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 使用计数器(有多少进程共享命名空间) Usage counter (how many processes share the namespace) |
| | 命名空间根目录的已安装文件系统描述符 Mounted filesystem descriptor for the root directory of the namespace |
| | 所有已挂载文件系统描述符列表的头部 Head of list of all mounted filesystem descriptors |
| | 读/写信号量保护该结构 Read/write semaphore protecting this structure |
该list字段是双向链接循环列表的头部,收集属于该名称空间的所有已安装文件系统。该root
字段指定已安装的文件系统,该文件系统表示该名称空间的已安装文件系统树的根。正如我们将在下一节中看到的,已安装的文件系统由vfsmount结构表示。
The list field is the head of
a doubly linked circular list collecting all mounted filesystems that
belong to the namespace. The root
field specifies the mounted filesystem that represents the root of the
tree of mounted filesystems of this namespace. As we will see in the
next section, mounted filesystems are represented by vfsmount structures.
在大多数传统的类 Unix 内核中,每个文件系统只能挂载一次。假设 Ext2 文件系统通过发出以下命令将存储在/dev/fd0软盘中的文件安装到 /flp上:
In most traditional Unix-like kernels, each filesystem can be mounted only once. Suppose that an Ext2 filesystem stored in the /dev/fd0 floppy disk is mounted on /flp by issuing the command:
挂载-t ext2 /dev/fd0 /flp
mount -t ext2 /dev/fd0 /flp
在通过发出命令卸载文件系统之前umount,作用于/dev/fd0 的所有其他挂载命令都会失败。
Until the filesystem is unmounted by issuing a umount command, every other mount command
acting on /dev/fd0 fails.
然而,Linux 不同:可以多次挂载同一个文件系统。当然,如果一个文件系统被挂载 n次,则可以通过n个挂载点访问其根目录,每次挂载操作一个。虽然可以使用不同的挂载点来访问同一个文件系统,但它确实是独一无二的。因此,所有这些对象都只有一个超级块对象,无论它被安装了多少次。
However, Linux is different: it is possible to mount the same filesystem several times. Of course, if a filesystem is mounted n times, its root directory can be accessed through n mount points, one per mount operation. Although the same filesystem can be accessed by using different mount points, it is really unique. Thus, there is only one superblock object for all of them, no matter of how many times it has been mounted.
已挂载的文件系统形成层次结构:文件系统的挂载点可能是第二个文件系统的目录,而第二个文件系统又已挂载在第三个文件系统上,依此类推。[ * ]
Mounted filesystems form a hierarchy: the mount point of a filesystem might be a directory of a second filesystem, which in turn is already mounted over a third filesystem, and so on.[*]
还可以在单个安装点上堆叠多个安装。同一挂载点上的每个新挂载都会隐藏以前挂载的文件系统,尽管已经使用旧挂载下的文件和目录的进程可以继续这样做。当最上面安装时被移除,然后下一个较低的安装再次可见。
It is also possible to stack multiple mounts on a single mount point. Each new mount on the same mount point hides the previously mounted filesystem, although processes already using the files and directories under the old mount can continue to do so. When the topmost mounting is removed, then the next lower mount is once more made visible.
正如您可以想象的那样,跟踪已安装的文件系统很快就会成为一场噩梦。对于每个挂载操作,内核必须在内存中保存挂载点和挂载标志,以及要挂载的文件系统和其他已挂载的文件系统之间的关系。此类信息存储在类型为已安装的文件系统描述符vfsmount中。该描述符的字段如表12-12所示。
As you can imagine, keeping track of mounted filesystems can
quickly become a nightmare. For each mount operation, the kernel must
save in memory the mount point and the mount flags, as well as the
relationships between the filesystem to be mounted and the other
mounted filesystems. Such information is stored in a mounted
filesystem descriptor of type vfsmount. The fields of this descriptor are
shown in Table
12-12.
表 12-12。vfsmount数据结构的字段
Table 12-12. The fields of the vfsmount data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 哈希表列表的指针。 Pointers for the hash table list. |
| | 指向安装此文件系统的父文件系统。 Points to the parent filesystem on which this filesystem is mounted. |
| | 指向 Points to the |
| | 指向 Points to the |
| | 指向该文件系统的超级块对象。 Points to the superblock object of this filesystem. |
| | 列表的头部,包括安装在该文件系统的目录上的所有文件系统描述符。 Head of a list including all filesystem descriptors mounted on directories of this filesystem. |
| | 已安装文件系统描述符列表的指针 Pointers for the |
| | 使用计数器(增加以禁止文件系统卸载)。 Usage counter (increased to forbid filesystem unmounting). |
| | 旗帜。 Flags. |
整数 int | mnt_expiry_mark mnt_expiry_mark | 如果文件系统被标记为过期,则标志设置为 true(如果设置了标志并且没有人使用文件系统,则可以自动卸载文件系统)。 Flag set to true if the filesystem is marked as expired (the filesystem can be automatically unmounted if the flag is set and no one is using it). |
| | 设备文件名。 Device filename. |
| | 已安装文件系统描述符的命名空间列表的指针。 Pointers for namespace's list of mounted filesystem descriptors. |
| mnt_fslink mnt_fslink | 特定于文件系统的过期列表的指针。 Pointers for the filesystem-specific expire list. |
结构体命名空间* struct namespace * | mnt_命名空间 mnt_namespace | 指向挂载文件系统的进程的名称空间的指针。 Pointer to the namespace of the process that mounted the filesystem. |
数据vfsmount结构保存在几个双向链接的循环列表中:
The vfsmount data structures
are kept in several doubly linked circular lists:
vfsmount由父文件系统描述符的地址和挂载点目录的dentry对象的地址索引的哈希表。哈希表存储在mount_hashtable数组中,数组的大小取决于系统中 RAM 的大小。表中的每一项都是循环双向链表的头,存储着具有相同哈希值的所有描述符。描述符的字段mnt_hash包含指向该列表中相邻元素的指针。
A hash table indexed by the address of the vfsmount descriptor of the parent
filesystem and the address of the dentry object of the mount point
directory. The hash table is stored in the mount_hashtable array, whose size
depends on the amount of RAM in the system. Each item of the table
is the head of a circular doubly linked list storing all
descriptors that have the same hash value. The mnt_hash field of the descriptor
contains the pointers to adjacent elements in this list.
对于每个命名空间,一个循环双向链表,包括属于该命名空间的所有已安装的文件系统描述符。list结构体的字段存储
namespace列表的头部,而描述符mnt_list
的字段vfsmount
包含指向列表中相邻元素的指针。
For each namespace, a circular doubly linked list including
all mounted filesystem descriptors belonging to the namespace. The
list field of the namespace structure stores the head of
the list, while the mnt_list
field of the vfsmount
descriptor contains the pointers to adjacent elements in the
list.
对于每个已安装的文件系统,包含所有子已安装文件系统的循环双向链表。每个列表的头存储在mnt_mounts已挂载的文件系统描述符的字段中;此外,mnt_child描述符的字段存储指向列表中相邻元素的指针。
For each mounted filesystem, a circular doubly linked list
including all child mounted filesystems. The head of each list is
stored in the mnt_mounts field
of the mounted filesystem descriptor; moreover, the mnt_child field of the descriptor stores
the pointers to the adjacent elements in the list.
自旋vfsmount_lock锁保护已安装文件系统对象的列表免受并发访问。
The vfsmount_lock spin lock
protects the lists of mounted filesystem objects from concurrent
accesses.
描述符的字段mnt_flags存储几个标志的值,这些标志指定如何处理已安装的文件系统中的某些类型的文件。这些标志可以通过mount命令的选项进行设置,如表 12-13所示。
The mnt_flags field of the
descriptor stores the value of several flags that specify how some
kinds of files in the mounted filesystem are handled. These flags,
which can be set through options of the mount command, are listed in Table 12-13.
表 12-13。挂载的文件系统标志
Table 12-13. Mounted filesystem flags
姓名 Name | 描述 Description |
|---|---|
| 禁止 Forbid |
| 禁止访问已安装文件系统中的设备文件 Forbid access to device files in the mounted filesystem |
| 禁止在已安装的文件系统中执行程序 Disallow program execution in the mounted filesystem |
以下是一些处理已安装文件系统描述符的函数:
Here are some functions that handle the mounted filesystem descriptors:
alloc_vfsmnt(name)alloc_vfsmnt(name)分配并初始化已安装的文件系统描述符
Allocates and initializes a mounted filesystem descriptor
free_vfsmnt(mnt)free_vfsmnt(mnt)释放指向的已安装文件系统描述符
mnt
Frees a mounted filesystem descriptor pointed to by
mnt
lookup_mnt(mnt,
dentry)lookup_mnt(mnt,
dentry)在哈希表中查找描述符并返回其地址
Looks up a descriptor in the hash table and returns its address
现在我们将描述内核为了挂载文件系统而执行的操作。我们首先考虑将要安装在已安装文件系统的目录上的文件系统(在本次讨论中,我们将这个新文件系统称为“通用”)。
We'll now describe the actions performed by the kernel in order to mount a filesystem. We'll start by considering a filesystem that is going to be mounted over a directory of an already mounted filesystem (in this discussion we will refer to this new filesystem as "generic").
这mount( ) 系统调用用于挂载通用文件系统;其
sys_mount( )服务例程作用于以下参数:
The mount( ) system call is used to mount a generic filesystem; its
sys_mount( ) service routine acts
on the following parameters:
包含文件系统的设备文件的路径名,或者
NULL如果不需要(例如,当要挂载的文件系统是基于网络的文件系统时)
The pathname of a device file containing the filesystem, or
NULL if it is not required (for
instance, when the filesystem to be mounted is
network-based)
将安装文件系统的目录的路径名(安装点)
The pathname of the directory on which the filesystem will be mounted (the mount point)
文件系统类型,必须是已注册文件系统的名称
The filesystem type, which must be the name of a registered filesystem
The mount flags (permitted values are listed in Table 12-14)
指向依赖于文件系统的数据结构的指针(可能是NULL)
A pointer to a filesystem-dependent data structure (which
may be NULL)
表 12-14。mount() 系统调用使用的标志
Table 12-14. Flags used by the mount() system call
宏 Macro | 描述 Description |
|---|---|
| 文件只能读取 Files can only be read |
| 禁止 Forbid |
| 禁止访问设备文件 Forbid access to device files |
| 禁止程序执行 Disallow program execution |
| 对文件和目录的写操作是立即的 Write operations on files and directories are immediate |
| 重新挂载文件系统更改挂载标志 Remount the filesystem changing the mount flags |
| 允许强制锁定 Mandatory locking allowed |
| 对目录的写操作是立即的 Write operations on directories are immediate |
| 不更新文件访问时间 Do not update file access time |
| 不更新目录访问时间 Do not update directory access time |
| 创建“绑定挂载”,它允许使文件或目录在系统目录树的另一点可见(挂载 Create a "bind mount," which allows
making a file or directory visible at another point of the
system directory tree (option |
| 以原子方式将已安装的文件系统移动到另一个安装点(mount Atomically move a mounted filesystem
to another mount point (option |
| 为目录子树递归创建“绑定挂载” Recursively create "bind mounts" for a directory subtree |
| 生成有关安装错误的内核消息 Generate kernel messages on mount errors |
该sys_mount( )函数将参数值复制到临时内核缓冲区中,获取大内核锁,并调用该do_mount(
)函数。一旦do_mount(
)返回,服务例程就会释放大内核锁并释放临时内核缓冲区。
The sys_mount( ) function
copies the value of the parameters into temporary kernel buffers,
acquires the big kernel lock , and invokes the do_mount(
) function. Once do_mount(
) returns, the service routine releases the big kernel lock
and frees the temporary kernel buffers.
该do_mount( )函数通过执行以下操作来处理实际的安装操作:
The do_mount( ) function
takes care of the actual mount operation by performing the following
operations:
如果设置了某些MS_NOSUID、
MS_NODEV或mount 标志,则会清除它们并在已安装的文件系统对象中设置相应的标志 ( , , )。MS_NOEXECMNT_NOSUIDMNT_NODEVMNT_NOEXEC
If some of the MS_NOSUID,
MS_NODEV, or MS_NOEXEC mount flags are set, it clears
them and sets the corresponding flag (MNT_NOSUID, MNT_NODEV, MNT_NOEXEC) in the mounted filesystem
object.
通过调用查找挂载点的路径名
path_lookup( );该函数将路径名查找的结果存储在
nd类型的局部变量中(参见后面的“路径名查找nameidata”部分)。
Looks up the pathname of the mount point by invoking
path_lookup( ); this function
stores the result of the pathname lookup in the local variable
nd of type nameidata (see the later section "Pathname
Lookup").
检查安装标志以确定必须执行的操作。尤其:
如果MS_REMOUNT
指定了该标志,其目的通常是更改s_flags超级块对象字段中的安装标志和已安装文件系统mnt_flags对象字段中的已安装文件系统标志。该do_remount( )函数执行这些更改。
否则,它会检查该MS_BIND标志。如果指定,则用户要求使系统目录树的另一点上的文件或目录可见。
否则,它会检查该MS_MOVE标志。如果指定,则用户要求更改已安装文件系统的安装点。该do_move_mount( )函数以原子方式完成此操作。
否则,它会调用do_new_mount( ). 这是最常见的情况。当用户请求挂载存储在磁盘分区中的特殊文件系统或常规文件系统时,会触发它。do_new_mount(
)调用do_kern_mount( )向其传递文件系统类型、安装标志和块设备名称的函数。该函数负责实际的挂载操作并返回新挂载的文件系统描述符的地址,如下所述。接下来,do_new_mount( )调用do_add_mount( ),它主要执行以下操作:
获取用于写入namespace->sem当前进程的信号量,因为该函数将修改命名空间。
该do_kern_mount(
)函数可能会使当前进程进入睡眠状态;同时,另一个进程可能会在与我们的安装点完全相同的安装点上安装文件系统,甚至更改我们的根文件系统(current->namespace->root)。验证此挂载点上最后挂载的文件系统是否仍然引用 的current命名空间;如果没有,则释放读/写信号量并返回错误代码。
如果要挂载的文件系统已经挂载在系统调用参数指定的挂载点上,或者挂载点是符号链接,则释放读/写信号量并返回错误代码。
mnt_flags初始化由 分配的新挂载文件系统对象字段中的标志do_kern_mount( )。
调用graft_tree(
)将新安装的文件系统对象插入命名空间列表、哈希表以及父安装文件系统的子列表中。
释放namespace->sem读/写信号量并返回。
Examines the mount flags to determine what has to be done. In particular:
If the MS_REMOUNT
flag is specified, the purpose is usually to change the mount
flags in the s_flags field
of the superblock object and the mounted filesystem flags in
the mnt_flags field of the
mounted filesystem object. The do_remount( ) function performs
these changes.
Otherwise, it checks the MS_BIND flag. If it is specified,
the user is asking to make visible a file or directory on
another point of the system directory tree.
Otherwise, it checks the MS_MOVE flag. If it is specified,
the user is asking to change the mount point of an already
mounted filesystem. The do_move_mount( ) function does this
atomically.
Otherwise, it invokes do_new_mount( ). This is the most
common case. It is triggered when the user asks to mount
either a special filesystem or a regular filesystem stored in
a disk partition. do_new_mount(
) invokes the do_kern_mount( ) function passing to
it the filesystem type, the mount flags, and the block device
name. This function, which takes care of the actual mount
operation and returns the address of a new mounted filesystem
descriptor, is described below. Next, do_new_mount( ) invokes do_add_mount( ), which essentially
performs the following actions:
Acquires for writing the namespace->sem semaphore of
the current process, because the function is going to
modify the namespace.
The do_kern_mount(
) function might put the current process to
sleep; meanwhile, another process might mount a filesystem
on the very same mount point as ours or even change our
root filesystem (current->namespace->root).
Verifies that the lastly mounted filesystem on this mount
point still refers to the current's namespace; if not,
releases the read/write semaphore and returns an error
code.
If the filesystem to be mounted is already mounted on the mount point specified as parameter of the system call, or if the mount point is a symbolic link, it releases the read/write semaphore and returns an error code.
Initializes the flags in the mnt_flags field of the new
mounted filesystem object allocated by do_kern_mount( ).
Invokes graft_tree(
) to insert the new mounted filesystem object in
the namespace list, in the hash table, and in the children
list of the parent-mounted filesystem.
Releases the namespace->sem read/write
semaphore and returns.
调用path_release( )
终止挂载点的路径名查找(请参阅后面的“路径名查找”部分)并返回 0。
Invokes path_release( )
to terminate the pathname lookup of the mount point (see the later
section "Pathname
Lookup") and returns 0.
挂载操作的核心是该do_kern_mount( )函数,它检查文件系统类型标志以确定如何完成挂载操作。该函数接收以下参数:
The core of the mount operation is the do_kern_mount( ) function, which checks
the filesystem type flags to determine how the mount operation is to
be done. This function receives the following parameters:
fstypefstype要挂载的文件系统类型的名称
The name of the filesystem type to be mounted
flagsflags挂载标志(参见表12-14)
The mount flags (see Table 12-14)
namename存储文件系统的块设备的路径名(或特殊文件系统的文件系统类型名称)
The pathname of the block device storing the filesystem (or the filesystem type name for special filesystems)
datadata指向要传递给read_super文件系统方法的附加数据的指针
Pointer to additional data to be passed to the read_super method of the
filesystem
该函数通过执行以下操作来处理实际的安装操作:
The function takes care of the actual mount operation by performing essentially the following operations:
调用get_fs_type( )
在文件系统类型列表中搜索并找到存储在fstype
参数中的名称;get_fs_type( )
在局部变量中返回type相应描述符的地址
file_system_type
。
Invokes get_fs_type( )
to search in the list of filesystem types and locate the name
stored in the fstype
parameter; get_fs_type( )
returns in the local variable type the address of the corresponding
file_system_type
descriptor.
调用alloc_vfsmnt( )
分配新安装的文件系统描述符并将其地址存储在mnt局部变量中。
Invokes alloc_vfsmnt( )
to allocate a new mounted filesystem descriptor and stores its
address in the mnt local
variable.
调用type->get_sb(
)与文件系统相关的函数来分配新的超级块并对其进行初始化(见下文)。
Invokes the type->get_sb(
) filesystem-dependent function to allocate a new
superblock and to initialize it (see below).
mnt->mnt_sb使用新超级块对象的地址初始化该字段。
Initializes the mnt->mnt_sb field with the address
of the new superblock object.
mnt->mnt_root使用文件系统根目录对应的dentry对象的地址初始化该字段,并增加dentry对象的使用计数器。
Initializes the mnt->mnt_root field with the
address of the dentry object corresponding to the root directory
of the filesystem, and increases the usage counter of the dentry
object.
mnt->mnt_parent使用中的值初始化该字段(对于通用文件系统,当已安装的文件系统描述符通过 插入到正确的列表中时,将设置
mnt正确的 值;请参阅 的步骤 3d5 )。mnt_parentgraft_tree( )do_mount( )
Initializes the mnt->mnt_parent field with the
value in mnt (for generic
filesystems, the proper value of mnt_parent will be set when the
mounted filesystem descriptor is inserted in the proper lists by
graft_tree( ); see step 3d5
of do_mount( )).
mnt->mnt_namespace使用 中的值初始化该字段current->namespace。
Initializes the mnt->mnt_namespace field with the
value in current->namespace.
释放s_umount
超级块对象的读/写信号量(在步骤 3 分配对象时获取)。
Releases the s_umount
read/write semaphore of the superblock object (it was acquired
when the object was allocated in step 3).
mnt
返回已安装的文件系统对象的地址。
Returns the address mnt
of the mounted filesystem object.
文件系统对象的方法get_sb通常由一行函数实现。例如,在Ext2文件系统中该方法的实现如下:
The get_sb method
of the filesystem object is usually implemented by a one-line
function. For instance, in the Ext2 filesystem the method is
implemented as follows:
结构 super_block * ext2_get_sb(结构 file_system_type *类型,
int 标志、const char *dev_name、void *data)
{
返回 get_sb_bdev(类型、标志、dev_name、数据、ext2_fill_super);
} struct super_block * ext2_get_sb(struct file_system_type *type,
int flags, const char *dev_name, void *data)
{
return get_sb_bdev(type, flags, dev_name, data, ext2_fill_super);
}VFS函数get_sb_bdev( )分配并初始化适合基于磁盘的文件系统的新超级块; 它接收函数的地址ext2_fill_super( ),该函数从 Ext2 磁盘分区读取磁盘超级块。
The get_sb_bdev( ) VFS
function allocates and initializes a new superblock suitable for
disk-based filesystems ; it receives the address of the ext2_fill_super( ) function, which reads
the disk superblock from the Ext2 disk partition.
分配适合特殊文件系统的超级块,VFS还提供了该get_sb_pseudo( )功能(对于没有挂载点的特殊文件系统,例如pipefs
),get_sb_single(
)函数(对于具有单个挂载点的特殊文件系统,例如sysfs )和get_sb_nodev(
)函数(对于可以多次挂载的特殊文件系统,例如tmpfs ; 见下文)。
To allocate superblocks suitable for special
filesystems , the VFS also provides the get_sb_pseudo( ) function (for special
filesystems with no mount point such as pipefs
), the get_sb_single(
) function (for special filesystems with single mount
point such as sysfs ), and the get_sb_nodev(
) function (for special filesystems that can be mounted
several times such as tmpfs ; see below).
最重要的操作执行get_sb_bdev( )如下:
The most important operations performed by get_sb_bdev( ) are the following:
调用open_bdev_excl(
)以打开具有设备文件名的块设备
(请参阅第 13 章中的“字符设备驱动程序”dev_name部分)。
Invokes open_bdev_excl(
) to open the block device having device file name
dev_name (see the section
"Character Device
Drivers" in Chapter
13).
调用sget( )以搜索文件系统的超级块对象列表(请参阅前面的“文件系统类型注册type->fs_supers”部分)。如果相对于块设备的超级块已经存在,则该函数返回其地址。否则,它分配并初始化一个新的超级块对象,将其插入文件系统列表和全局超级块列表中,并返回其地址。
Invokes sget( ) to
search the list of superblock objects of the filesystem
(type->fs_supers, see the
earlier section "Filesystem Type
Registration"). If a superblock relative to the block
device is already present, the function returns its address.
Otherwise, it allocates and initializes a new superblock object,
inserts it into the filesystem list and in the global list of
superblocks, and returns its address.
如果超级块不是新的(在上一步中没有分配,因为文件系统已经安装),则跳转到步骤 6。
If the superblock is not new (it was not allocated in the previous step, because the filesystem is already mounted), it jumps to step 6.
将参数值复制flags到s_flags超级块的字段中,并使用块设备的正确值设置s_id、s_old_blocksize和字段。s_blocksize
Copies the value of the flags parameter into the s_flags field of the superblock and
sets the s_id, s_old_blocksize, and s_blocksize fields with the proper
values for the block device.
调用作为最后一个参数传递的与文件系统相关的函数,以get_sb_bdev( )访问磁盘上的超级块信息并填充新超级块对象的其他字段。
Invokes the filesystem-dependent function passed as last
argument to get_sb_bdev( ) to
access the superblock information on disk and fill the other
fields of the new superblock object.
返回新超级块对象的地址。
Returns the address of the new superblock object.
挂载根文件系统是系统初始化的关键部分。这是一个相当复杂的过程,因为 Linux 内核允许根文件系统存储在许多不同的地方,例如硬盘分区、软盘、通过 NFS 共享的远程文件系统,甚至是 ramdisk(一种虚构的块设备)保存在 RAM 中)。
Mounting the root filesystem is a crucial part of system initialization. It is a fairly complex procedure, because the Linux kernel allows the root filesystem to be stored in many different places, such as a hard disk partition, a floppy disk, a remote filesystem shared via NFS, or even a ramdisk (a fictitious block device kept in RAM).
为了使描述简单,我们假设根文件系统存储在硬盘的分区中(毕竟是最常见的情况)。当系统启动时,内核会在ROOT_DEV变量中查找包含根文件系统的磁盘的主编号(请参阅附录 A)。在编译内核时或通过将合适的“root”选项传递给初始引导加载程序时,可以将根文件系统指定为/dev目录中的设备文件。类似地,根文件系统的安装标志存储在该变量中。用户通过使用rdev指定这些标志root_mountflags已编译的内核映像上的外部程序或通过将合适的rootflags选项传递给初始引导加载程序(请参阅附录 A)。
To keep the description simple, let's assume that the root
filesystem is stored in a partition of a hard disk (the most common
case, after all). While the system boots, the kernel finds the major
number of the disk that contains the root filesystem in the ROOT_DEV variable (see Appendix A). The root filesystem
can be specified as a device file in the /dev directory either when compiling the
kernel or by passing a suitable "root" option to
the initial bootstrap loader. Similarly, the mount flags of the root
filesystem are stored in the root_mountflags variable. The user specifies
these flags either by using the rdev external program on a compiled kernel
image or by passing a suitable rootflags option
to the initial bootstrap loader (see Appendix A).
挂载根文件系统是一个两阶段的过程,如下列表所示:
Mounting the root filesystem is a two-stage procedure, shown in the following list:
为什么内核要 在真正的文件系统之前挂载rootfs文件系统呢?好吧, rootfs文件系统允许内核轻松更改真正的根文件系统。事实上,在某些情况下,内核会一个接一个地挂载和卸载多个根文件系统。例如,发行版的初始引导 CD 可能会在 RAM 中加载一个带有最少驱动程序集的内核,该内核将存储在 ramdisk 中的最小文件系统作为 root 挂载。接下来,这个初始根文件系统中的程序探测系统的硬件(例如,它们确定硬盘是 EIDE、SCSI 还是其他硬盘),加载所有需要的内核模块,并从物理块设备重新挂载根文件系统。
Why does the kernel bother to mount the rootfs filesystem before the real one? Well, the rootfs filesystem allows the kernel to easily change the real root filesystem. In fact, in some cases, the kernel mounts and unmounts several root filesystems, one after the other. For instance, the initial bootstrap CD of a distribution might load in RAM a kernel with a minimal set of drivers, which mounts as root a minimal filesystem stored in a ramdisk. Next, the programs in this initial root filesystem probe the hardware of the system (for instance, they determine whether the hard disk is EIDE, SCSI, or whatever), load all needed kernel modules, and remount the root filesystem from a physical block device.
第一阶段由init_rootfs( )和init_mount_tree( )函数执行,它们在系统初始化期间执行。
The first stage is performed by the init_rootfs( ) and init_mount_tree( ) functions, which are
executed during system initialization.
该init_rootfs( )函数注册特殊文件系统类型
rootfs:
The init_rootfs( ) function
registers the special filesystem type
rootfs:
struct file_system_type rootfs_fs_type = {
.name = "rootfs";
.get_sb = rootfs_get_sb;
.kill_sb = Kill_litter_super;
};
register_filesystem(&rootfs_fs_type); struct file_system_type rootfs_fs_type = {
.name = "rootfs";
.get_sb = rootfs_get_sb;
.kill_sb = kill_litter_super;
};
register_filesystem(&rootfs_fs_type);该init_mount_tree( )
函数执行以下操作:
The init_mount_tree( )
function executes the following operations:
调用do_kern_mount(
)将字符串“ rootfs”作为文件系统类型传递给它,并将该函数返回的已安装文件系统描述符的地址存储在mnt
本地变量中。正如上一节所解释的,do_kern_mount( )最终会调用
rootfsget_sb文件系统的方法
,即函数:rootfs_get_sb( )
结构超级块 *rootfs_get_sb(结构 file_system_type *fs_type,
int 标志、const char *dev_name、void *data)
{
返回 get_sb_nodev(fs_type, flags|MS_NOUSER, 数据,
ramfs_fill_super);
}该get_sb_nodev( )
函数依次执行以下步骤:
调用sget( )分配一个新的超级块作为参数传递函数的地址(请参阅前面的“特殊文件系统set_anon_super( )”
部分)。结果,超级块的字段以适当的方式设置:主编号0,次编号与其他已安装的特殊文件系统不同。s_dev
将参数的值复制flags到s_flags超级块的字段中。
调用ramfs_fill_super(
)以分配 inode 对象和相应的 dentry 对象,并填充超级块字段。由于
rootfs是一个特殊的文件系统,没有磁盘超级块,因此只需要实现几个超级块操作。
返回新超级块的地址。
Invokes do_kern_mount(
) passing to it the string "rootfs" as filesystem type, and stores
the address of the mounted filesystem descriptor returned by
this function in the mnt
local variable. As explained in the previous section, do_kern_mount( ) ends up invoking the
get_sb method of the
rootfs filesystem, that is, the rootfs_get_sb( ) function:
struct superblock *rootfs_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
return get_sb_nodev(fs_type, flags|MS_NOUSER, data,
ramfs_fill_super);
}The get_sb_nodev( )
function, in turn, executes the following steps:
Invokes sget( ) to
allocate a new superblock passing as parameter the address
of the set_anon_super( )
function (see the earlier section "Special
Filesystems"). As a result, the s_dev field of the superblock is
set in the appropriate way: major number 0, minor number
different from those of other mounted special
filesystems.
Copies the value of the flags parameter into the s_flags field of the
superblock.
Invokes ramfs_fill_super(
) to allocate an inode object and a corresponding
dentry object, and to fill the superblock fields. Because
rootfs is a special filesystem that has
no disk superblock, only a couple of superblock operations
need to be implemented.
Returns the address of the new superblock.
为进程 0 的命名空间分配一个namespace
对象,并将由返回的已安装文件系统描述符插入其中do_kern_mount( ):
命名空间 = kmalloc(sizeof(*命名空间), GFP_KERNEL);
list_add(&mnt->mnt_list, &命名空间->list);
命名空间->root = mnt;
mnt->mnt_namespace = init_task.namespace = 命名空间;Allocates a namespace
object for the namespace of process 0, and inserts into it the
mounted filesystem descriptor returned by do_kern_mount( ):
namespace = kmalloc(sizeof(*namespace), GFP_KERNEL);
list_add(&mnt->mnt_list, &namespace->list);
namespace->root = mnt;
mnt->mnt_namespace = init_task.namespace = namespace;namespace
将系统中所有其他进程的字段设置为命名空间对象的地址;还初始化namespace->count使用计数器。(默认情况下,所有进程共享相同的初始命名空间。)
Sets the namespace
field of every other process in the system to the address of the
namespace object; also initializes the namespace->count usage counter. (By
default, all processes share the same, initial
namespace.)
将进程 0 的根目录和当前工作目录设置为根文件系统。
Sets the root directory and the current working directory of process 0 to the root filesystem.
根文件系统挂载操作的第二阶段由内核在系统初始化即将结束时执行。根据编译内核时选择的选项以及内核加载程序传递的引导选项,有多种方法可以挂载真正的根文件系统。为了简洁起见,我们考虑基于磁盘的文件系统的情况,其设备文件名已通过“ root ”引导参数传递到内核。我们还假设除了 rootfs文件系统之外,没有使用任何初始特殊文件系统。
The second stage of the mount operation for the root filesystem is performed by the kernel near the end of the system initialization. There are several ways to mount the real root filesystem, according to the options selected when the kernel has been compiled and to the boot options passed by the kernel loader. For the sake of brevity, we consider the case of a disk-based filesystem whose device file name has been passed to the kernel by means of the "root" boot parameter. We also assume that no initial special filesystem is used, except the rootfs filesystem.
该prepare_namespace( )
函数执行以下操作:
The prepare_namespace( )
function executes the following operations:
使用从“ rootroot_device_name ”引导参数获取的设备文件名设置变量。另外,使用同一设备文件的主设备号和次设备号设置变量。ROOT_DEV
Sets the root_device_name variable with the
device filename obtained from the "root"
boot parameter. Also, sets the ROOT_DEV variable with the major and
minor numbers of the same device file.
调用该mount_root(
)函数,该函数依次:
调用sys_mknod(
)(的服务例程mknod( ) 系统调用)在
rootfs初始根文件系统中创建/dev/root设备文件,其主设备号和次设备号如 中所示。ROOT_DEV
分配一个缓冲区并用文件系统类型名称列表填充它。该列表要么通过“ rootfstype ”引导参数传递给内核,要么通过扫描文件系统类型单链表中的元素来构建。
扫描上一步中构建的文件系统类型名称列表。对于每个名称,它都会调用sys_mount( )尝试在根设备上挂载给定的文件系统类型。由于每种特定于文件系统的方法都使用不同的幻数,因此get_sb( )除了尝试使用根设备上实际使用的文件系统的函数来填充超级块的调用之外,所有调用都将失败。文件系统安装在rootfs文件系统
的名为/root的目录中。
调用sys_chdir("/root")以更改进程的当前目录。
Invokes the mount_root(
) function, which in turn:
Invokes sys_mknod(
) (the service routine of the mknod( ) system call) to create a /dev/root device file in the
rootfs initial root filesystem, having
the major and minor numbers as in ROOT_DEV.
Allocates a buffer and fills it with a list of filesystem type names. This list is either passed to the kernel in the "rootfstype" boot parameter or built by scanning the elements in the singly linked list of filesystem types.
Scans the list of filesystem type names built in the
previous step. For each name, it invokes sys_mount( ) to try to mount the
given filesystem type on the root device. Because each
filesystem-specific method uses a different magic number,
all get_sb( ) invocations
will fail except the one that attempts to fill the
superblock by using the function of the filesystem really
used on the root device. The filesystem is mounted on a
directory named /root
of the rootfs filesystem.
Invokes sys_chdir("/root") to change the
current directory of the process.
将已挂载文件系统的挂载点移动到rootfs 文件系统的根目录上:
sys_mount(".", "/", NULL, MS_MOVE, NULL);
sys_chroot(".");Moves the mount point of the mounted filesystem on the root directory of the rootfs filesystem:
sys_mount(".", "/", NULL, MS_MOVE, NULL);
sys_chroot(".");请注意rootfs特殊文件系统未卸载:它仅隐藏在基于磁盘的根文件系统下。
Notice that the rootfs special filesystem is not unmounted: it is only hidden under the disk-based root filesystem.
这umount( )
系统调用用于卸载文件系统。相应的sys_umount( )服务例程作用于两个参数:文件名(安装点目录或块设备文件名)和一组标志。它执行以下操作:
The umount( )
system call is used to unmount a filesystem. The
corresponding sys_umount( ) service
routine acts on two parameters: a filename (either a mount point
directory or a block device filename) and a set of flags. It performs
the following actions:
调用path_lookup( )查找挂载点路径名;nd该函数在类型的局部变量中返回查找操作的结果nameidata(请参阅下一节)。
Invokes path_lookup( ) to
look up the mount point pathname; this function returns the
results of the lookup operation in a local variable nd of type nameidata (see next section).
如果生成的目录不是文件系统的挂载点,它将retval
返回代码设置为-EINVAL并跳转到步骤 6。此检查是通过验证 是否nd->mnt->mnt_root包含 指向的 dentry 对象的地址来完成的nd.dentry。
If the resulting directory is not the mount point of a
filesystem, it sets the retval
return code to -EINVAL and
jumps to step 6. This check is done by verifying that nd->mnt->mnt_root contains the
address of the dentry object pointed to by nd.dentry.
如果要卸载的文件系统尚未安装在命名空间中,则将retval返回代码设置为-EINVAL并跳转到步骤 6。(回想一下,某些特殊文件系统没有安装点。)此检查是通过调用check_mnt(
)上的函数来完成的nd->mnt。
If the filesystem to be unmounted has not been mounted in
the namespace, it sets the retval return code to -EINVAL and jumps to step 6. (Recall
that some special filesystems have no mount point.) This check is
done by invoking the check_mnt(
) function on nd->mnt.
如果用户没有卸载文件系统所需的权限,则会将retval返回代码设置为-EPERM并跳转到步骤 6。
If the user does not have the privileges required to unmount
the filesystem, it sets the retval return code to -EPERM and jumps to step 6.
调用do_umount( )
作为参数传递nd.mnt
(已安装的文件系统对象)和flags(标志集)。该函数主要执行以下操作:
sb从
已安装的文件系统对象的字段中检索超级块对象的地址mnt_sb。
如果用户要求强制卸载操作,它会通过调用
umount_begin超级块操作来中断任何正在进行的安装操作。
如果要卸载的文件系统是根文件系统并且用户没有要求实际分离它,则它会调用
do_remount_sb( )以只读方式重新挂载根文件系统并终止。
获取用于写入namespace->sem当前进程的读/写信号量,并获取vfsmount_lock自旋锁。
如果挂载的文件系统不包含任何子挂载文件系统的挂载点,或者如果用户要求强制分离文件系统,则它会调用umount_tree( )卸载该文件系统(以及所有子文件系统)。
释放
当前进程的vfsmount_lock自旋锁和
读/写信号量。namespace->sem
Invokes do_umount( )
passing as parameters nd.mnt
(the mounted filesystem object) and flags (the set of flags). This function
performs essentially the following operations:
Retrieves the address of the sb superblock object from the
mnt_sb field of the mounted
filesystem object.
If the user asked to force the unmount operation, it
interrupts any ongoing mount operation by invoking the
umount_begin superblock
operation.
If the filesystem to be unmounted is the root filesystem
and the user didn't ask to actually detach it, it invokes
do_remount_sb( ) to remount
the root filesystem read-only and terminates.
Acquires for writing the namespace->sem read/write
semaphore of the current process, and gets the vfsmount_lock spin lock.
If the mounted filesystem does not include mount points
for any child mounted filesystem, or if the user asked to
forcibly detach the filesystem, it invokes umount_tree( ) to unmount the
filesystem (together with all children filesystems).
Releases the vfsmount_lock spin lock and the
namespace->sem
read/write semaphore of the current process.
减少与文件系统根目录对应的 dentry 对象和已挂载文件系统描述符的使用计数器;这些计数器增加了
path_lookup( ).
Decreases the usage counters of the dentry object
corresponding to the root directory of the filesystem and of the
mounted filesystem descriptor; these counters were increased by
path_lookup( ).
返回retval
值。
Returns the retval
value.
[ * ]文件系统的根目录可以与进程的根目录不同:正如我们在前面的“与进程关联的文件”部分中所看到的,进程的根目录是与“ / ”路径名对应的目录。默认情况下,进程的根目录与系统根文件系统的根目录一致(或者更准确地说,与命名空间中根文件系统的根目录一致)的过程,在下一节中描述),但可以通过调用来更改chroot( ) 系统调用。
[*] The root directory of a filesystem can be different from the
root directory of a process: as we have seen in the earlier section
"Files Associated with
a Process," the process's root directory is the directory
corresponding to the "/"
pathname. By default, the process' root directory coincides with the
root directory of the system's root filesystem (or more precisely,
with the root directory of the root filesystem in the
namespace of the process, described in the following section),
but it can be changed by invoking the chroot( ) system call.
[ * ]非常令人惊讶的是,文件系统的挂载点可能是同一文件系统的目录,前提是它已经挂载了。例如:
[*] Quite surprisingly, the mount point of a filesystem might be a directory of the same filesystem, provided that it was already mounted. For instance:
挂载-t ext2 /dev/fd0 /flp; 触摸/flp/foo
mkdir /flp/mnt;挂载-t ext2 /dev/fd0 /flp/mnt mount -t ext2 /dev/fd0 /flp; touch /flp/foo
mkdir /flp/mnt; mount -t ext2 /dev/fd0 /flp/mnt现在,软盘文件系统上的空foo文件可以通过/flp/foo 和/flp/mnt/foo进行访问。
Now, the empty foo file on the floppy filesystem can be accessed both as /flp/foo and /flp/mnt/foo.
当进程必须对文件进行操作时,它将其文件路径名传递给某些 VFS 系统调用,例如open(
) , mkdir( ),rename( ) , 或者stat( )
。在本节中,我们将说明 VFS 如何执行
路径名查找 ,即它如何从相应的文件路径名派生出 inode。
When a process must act on a file, it passes its file
pathname to some VFS system call, such as open(
) , mkdir( ), rename( ) , or stat( )
. In this section, we illustrate how the VFS performs a
pathname lookup , that is, how it derives an inode from the corresponding
file pathname.
执行此任务的标准过程包括分析路径名并将其分解为文件名序列。除最后一个文件名外的所有文件名都必须标识目录。
The standard procedure for performing this task consists of analyzing the pathname and breaking it into a sequence of filenames . All filenames except the last must identify directories.
如果路径名的第一个字符是/current->fs->root ,则路径名是绝对路径,并且从(进程根目录)标识的目录开始搜索。current->fs->pwd否则,路径名是相对的,并且搜索从(进程当前目录)标识的目录开始。
If the first character of the pathname is /, the pathname is absolute, and the search
starts from the directory identified by current->fs->root (the process root
directory). Otherwise, the pathname is relative, and the search starts
from the directory identified by current->fs->pwd (the process-current
directory).
掌握初始目录的 dentry 和 inode 后,代码将检查与名字匹配的条目以派生相应的 inode。然后从磁盘读取具有该索引节点的目录文件,并检查与第二个名称匹配的条目以导出相应的索引节点。对路径中包含的每个名称重复此过程。
Having in hand the dentry, and thus the inode, of the initial directory, the code examines the entry matching the first name to derive the corresponding inode. Then the directory file that has that inode is read from disk and the entry matching the second name is examined to derive the corresponding inode. This procedure is repeated for each name included in the path.
目录项缓存大大加快了该过程,因为它将最近使用的 dentry 对象保留在内存中。正如我们之前所看到的,每个这样的对象都将特定目录中的文件名与其相应的索引节点相关联。因此,在许多情况下,对路径名的分析可以避免从磁盘读取中间目录。
The dentry cache considerably speeds up the procedure, because it keeps the most recently used dentry objects in memory. As we saw before, each such object associates a filename in a specific directory to its corresponding inode. In many cases, therefore, the analysis of the pathname can avoid reading the intermediate directories from disk.
然而,事情并不像看起来那么简单,因为必须考虑以下 Unix 和 VFS 文件系统功能:
However, things are not as simple as they look, because the following Unix and VFS filesystem features must be taken into consideration:
必须检查每个目录的访问权限,以验证是否允许进程读取该目录的内容。
The access rights of each directory must be checked to verify whether the process is allowed to read the directory's content.
文件名可以是对应于任意路径名的符号链接;在这种情况下,分析必须扩展到该路径名的所有组成部分。
A filename can be a symbolic link that corresponds to an arbitrary pathname; in this case, the analysis must be extended to all components of that pathname.
符号链接可能会引起循环引用;内核必须考虑到这种可能性,并在它们发生时中断无限循环。
Symbolic links may induce circular references; the kernel must take this possibility into account and break endless loops when they occur.
文件名可以是已安装文件系统的安装点。必须检测到这种情况,并且查找操作必须继续到新的文件系统中。
A filename can be the mount point of a mounted filesystem. This situation must be detected, and the lookup operation must continue into the new filesystem.
路径名查找必须在发出系统调用的进程的命名空间内完成。具有不同命名空间的两个进程使用的相同路径名可能指定不同的文件。
Pathname lookup has to be done inside the namespace of the process that issued the system call. The same pathname used by two processes with different namespaces may specify different files.
路径名查找由该函数执行path_lookup( ),该函数接收三个参数:
Pathname lookup is performed by the path_lookup( ) function, which receives three
parameters:
namename指向要解析的文件路径名的指针。
A pointer to the file pathname to be resolved.
flagsflags表示如何访问所查找文件的标志值。允许的值包含在后面的 表 12-16中。
The value of flags that represent how the looked-up file is going to be accessed. The allowed values are included later in Table 12-16.
ndnd数据结构的地址nameidata,存储查找操作的结果,其字段如
表12-15所示。
The address of a nameidata data structure, which stores
the results of the lookup operation and whose fields are shown in
Table
12-15.
返回时path_lookup( ),nameidata所指向的结构
nd将填充与路径名查找操作相关的数据。
When path_lookup( ) returns,
the nameidata structure pointed to by
nd is filled with data pertaining to
the pathname lookup operation.
表 12-15。nameidata 数据结构的字段
Table 12-15. The fields of the nameidata data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | dentry对象的地址 Address of the dentry object |
| | 已挂载的文件系统对象的地址 Address of the mounted filesystem object |
| | 路径名的最后一个组成部分( Last component of the pathname (used
when the |
| | 查找标志 Lookup flags |
| | 路径名最后一个组成部分的类型( Type of last component of the pathname
(used when the |
无符号整数 unsigned int | 深度 depth | 当前符号链接嵌套级别(见下文);它必须小于 6 Current level of symbolic link nesting (see below); it must be smaller than 6 |
字符[] * char[ ] * | 已保存的名称 saved_names | 与嵌套符号链接关联的路径名数组 Array of pathnames associated with nested symbolic links |
联盟 union | 意图 intent | 指定如何访问文件的单成员联合 One-member union specifying how the file will be accessed |
和字段分别指向路径名中最后解析的组件的 dentry 对象和已安装的文件系统对象dentry。mnt这两个字段“描述”由给定路径名标识的文件。
The dentry and mnt fields point respectively to the dentry
object and the mounted filesystem object of the last resolved component
in the pathname. These two fields "describe" the file that is identified
by the given pathname.
path_lookup( )
由于结构中函数返回的 dentry 对象和已安装的文件系统对象nameidata表示查找操作的结果,因此在调用者使用完它们之前不应释放这两个对象path_lookup(
)。因此,path_lookup( )增加两个对象的使用计数器。如果调用者想要释放这些对象,它会调用将结构体path_release( )地址作为参数传递的函数nameidata。
Because the dentry object and the mounted filesystem object
returned by the path_lookup( )
function in the nameidata structure
represent the result of a lookup operation, both objects should not be
freed until the caller of path_lookup(
) finishes using them. Therefore, path_lookup( ) increases the usage counters of
both objects. If the caller wants to release these objects, it invokes
the path_release( ) function passing
as parameter the address of a nameidata structure.
该flags字段存储查找操作中使用的一些标志的值;它们列于
表12-16中。flags大多数这些标志都可以由调用者在的参数中设置path_lookup( )。
The flags field stores the
value of some flags used in the lookup operation; they are listed in
Table 12-16. Most of
these flags can be set by the caller in the flags parameter of path_lookup( ).
表 12-16。查找操作的标志
Table 12-16. The flags of the lookup operation
宏 Macro | 描述 Description |
|---|---|
| 如果最后一个组件是符号链接,则解释(跟随)它 If the last component is a symbolic link, interpret (follow) it |
| 最后一个组件必须是目录 The last component must be a directory |
| 路径名中仍有待检查的文件名 There are still filenames to be examined in the pathname |
| 查找包含路径名最后一部分的目录 Look up the directory that includes the last component of the pathname |
| 不考虑模拟根目录(在80×86架构下无用) Do not consider the emulated root directory (useless in the 80×86 architecture) |
| 目的是打开一个文件 Intent is to open a file |
| 目的是创建一个文件(如果不存在) Intent is to create a file (if it doesn't exist) |
| 目的是检查用户对文件的权限 Intent is to check user's permission for a file |
该path_lookup( )函数执行以下步骤:
The path_lookup( ) function
executes the following steps:
初始化nd参数的一些字段如下:
将字段设置last_type
为LAST_ROOT(如果路径名是斜杠或斜杠序列,则需要此设置;请参阅后面的“父路径名查找”部分)。
将字段设置为
参数flags的值flags
将depth字段设置为 0。
Initializes some fields of the nd parameter as follows:
Sets the last_type
field to LAST_ROOT (this is
needed if the pathname is a slash or a sequence of slashes; see
the later section "Parent Pathname
Lookup").
Sets the flags field to
the value of the flags
parameter
Sets the depth field to
0.
获取用于读取current->fs->lock当前进程的读/写信号量。
Acquires for reading the current->fs->lock read/write
semaphore of the current process.
如果路径名中的第一个字符是斜杠(/),则查找操作必须从 的根目录开始:该函数获取相应挂载的文件系统对象( )和 dentry 对象( )current的地址,增加它们的使用计数器,并分别将地址存储在和中。current->fs->rootmntcurrent->fs->rootnd->mntnd->dentry
If the first character in the pathname is a slash (/ ), the lookup operation must start from
the root directory of current:
the function gets the addresses of the corresponding mounted
filesystem object (current->fs->rootmnt) and dentry
object (current->fs->root),
increases their usage counters, and stores the addresses in nd->mnt and nd->dentry, respectively.
否则,如果路径名中的第一个字符不是斜杠,则查找操作必须从 的当前工作目录开始:该函数获取相应安装的文件系统对象( )和 dentry 对象( )current的地址,增加它们的使用计数器,并将地址分别存储在和中。current->fs->pwdmntcurrent->fs->pwdnd->mntnd->dentry
Otherwise, if the first character in the pathname is not a
slash, the lookup operation must start from the current working
directory of current: the
function gets the addresses of the corresponding mounted filesystem
object (current->fs->pwdmnt) and dentry
object (current->fs->pwd),
increases their usage counters, and stores the addresses in nd->mnt and nd->dentry, respectively.
释放current->fs->lock当前进程的读/写信号量。
Releases the current->fs->lock read/write
semaphore of the current process.
total_link_count
将当前进程的描述符中的字段设置为0(参见后面的“符号链接的查找”部分)。
Sets the total_link_count
field in the descriptor of the current process to 0 (see the later
section "Lookup of
Symbolic Links").
调用link_path_walk(
)函数来处理正在进行的查找操作:
返回 link_path_walk(名称, nd);
Invokes the link_path_walk(
) function to take care of the undergoing lookup
operation:
return link_path_walk(name, nd);
现在我们准备描述路径名查找操作的核心,即函数link_path_walk(
)。它接收指向要解析的路径名的指针和数据结构的name地址作为其参数。ndnameidata
We are now ready to describe the core of the pathname lookup
operation, namely the link_path_walk(
) function. It receives as its parameters a pointer name to the pathname to be resolved and the
address nd of a nameidata data structure.
为了让事情变得更简单,我们首先描述link_path_walk( )当LOOKUP_PARENT未设置并且路径名不包含符号链接(标准路径名查找)时会做什么。接下来,我们讨论设置的情况LOOKUP_PARENT:在创建、删除或重命名目录条目时,即在父路径名查找期间,需要这种类型的查找。最后,我们解释该函数如何解析符号链接。
To make things a bit easier, we first describe what link_path_walk( ) does when LOOKUP_PARENT is not set and the pathname does
not contain symbolic links (standard pathname lookup). Next, we discuss
the case in which LOOKUP_PARENT is
set: this type of lookup is required when creating, deleting, or
renaming a directory entry, that is, during a parent pathname lookup.
Finally, we explain how the function resolves symbolic links.
当LOOKUP_PARENT标志被清除时,link_path_walk( )执行以下步骤。
When the LOOKUP_PARENT flag is cleared, link_path_walk( ) performs the following
steps.
使用 初始化lookup_flags局部变量
nd->flags。
Initializes the lookup_flags local variable with
nd->flags.
跳过路径名第一个组成部分之前的所有前导斜杠 (/)。
Skips all leading slashes (/) before the first component of the pathname.
如果剩余路径名为空,则返回值 0。在nameidata数据结构中,dentry和
mnt字段指向相对于原始路径名的最后解析组件的对象。
If the remaining pathname is empty, it returns the value 0.
In the nameidata data
structure, the dentry and
mnt fields point to the objects
relative to the last resolved component of the original
pathname.
如果描述符depth的字段
为正,则会在局部变量中nd设置标志(请参阅“符号链接的查找”部分)。LOOKUP_FOLLOWlookup_flags
If the depth field of the
nd descriptor is positive, it
sets the LOOKUP_FOLLOW flag in
the lookup_flags local variable
(see the section "Lookup of Symbolic
Links").
执行一个循环,将参数中传递的路径名分解
name为组件(中间斜杠被视为文件名分隔符);对于找到的每个组件,函数:
从 中检索最后解析组件的 inode 对象的地址nd->dentry->d_inode。(在第一次迭代中,索引节点指的是开始路径名查找的目录。)
检查存储在 inode 中的最后解析组件的权限是否允许执行(在 Unix 中,只有可执行的目录才能被遍历)。如果 inode 有自定义permission方法,则该函数执行它;否则,它执行该exec_permission_lite( )函数,该函数检查存储在 inode 字段中的访问模式i_mode以及正在运行的进程的权限。在这两种情况下,如果最后解析的组件不允许执行,link_path_walk( )则会中断循环并返回错误代码。
考虑下一个要解决的组件。根据其名称,该函数计算在查找 dentry 缓存哈希表时要使用的 32 位哈希值。
跳过终止要解析的组件名称的斜杠后面的任何尾部斜杠 (/)。
如果要解析的组件是原始路径名中的最后一个,则跳转到步骤 6。
如果组件的名称是“.” (一个点),它继续下一个组件(“.”指当前目录,因此它在路径名中不起作用)。
如果组件的名称是“..”(两个点),它会尝试爬到父目录:
如果最后解析的目录是进程的根目录(nd->dentry等于
current->fs->root
且nd->mnt等于current->fs->rootmnt),则不允许攀登:它调用follow_mount( )最后解析的组件(见下文)并继续下一个组件。
如果最后解析的目录是文件系统的根目录nd->mnt
(nd->dentry等于
nd->mnt->mnt_root)并且
nd->mnt文件系统没有安装在另一个文件系统之上(nd->mnt等于nd->mnt->mnt_parent),则nd->mnt
文件系统通常是[ * ]命名空间的根文件系统:在这种情况下,攀登是不可能的,因此调用follow_mount( )最后一个解析的组件(见下文)并继续下一个组件。
如果最后解析的目录是文件系统的根目录nd->mnt
,并且nd->mnt文件系统安装在另一个文件系统之上,则需要文件系统切换。因此,该函数设置nd->dentry为nd->mnt->mnt_mountpoint、 和nd->mnt,
nd->mnt->mnt_parent然后重新启动步骤 5g(回想一下,可以将多个文件系统安装在同一安装点上)。
如果最后解析的目录不是已安装文件系统的根目录,则该函数必须简单地爬到父目录:它设置为 ,nd->dentry调用nd->dentry->d_parent父follow_mount( )
目录,然后继续下一个组件。
该follow_mount( )
函数检查是否nd->dentry是某个文件系统的挂载点(nd->dentry->d_mounted大于零);在这种情况下,它调用lookup_mnt( )在 dentry 缓存中搜索已挂载文件系统的根目录,并使用与已安装文件系统相对应的对象地址更新nd->dentry和;nd->mnt然后,它重复整个操作(同一安装点上可以安装多个文件系统)。follow_mount( )本质上,在爬升到父目录时需要调用该函数,因为该进程可以从被安装在父目录上的另一个文件系统隐藏的文件系统中包含的目录开始路径名查找。
组件名称既不是“.” 也不是“..”,因此该函数必须在 dentry 缓存中查找它。如果低级文件系统具有自定义d_hashdentry 方法,则该函数会调用它来修改在步骤 5c 中计算出的哈希值。
设置LOOKUP_CONTINUE标志nd->flags以表示有下一个要分析的组件。
调用do_lookup( )
以派生与给定父目录 ( nd->dentry) 和文件名(正在解析的路径名部分)关联的 dentry 对象。该函数本质上首先调用_
_d_lookup( )以在 dentry 缓存中搜索组件的 dentry 对象。如果不存在这样的对象,
do_lookup( )则调用
real_lookup( ). lookup后一个函数通过执行inode 的方法从磁盘读取目录
,创建一个新的 dentry 对象并将其插入到 dentry 缓存中,然后创建一个新的 inode 对象并将其插入到 inode 缓存中。[ * ]在这一步结束时,局部变量的dentry和mnt字段next将分别指向本次循环要解析的组件名称的 dentry 对象和挂载的文件系统对象。
调用该follow_mount(
)函数来检查刚刚解析的组件 ( next.dentry) 是否引用了作为某些文件系统的安装点的目录(next.dentry->d_mounted大于零)。follow_mount( )
更新next.dentry,
next.mnt以便它们指向挂载在此路径名组件指定的目录上的最上面文件系统的 dentry 对象和挂载文件系统对象(请参见步骤 5g)。
检查刚刚解析的组件是否引用符号链接(next.dentry->d_inode具有自定义
follow_link方法)。我们将在后面的“符号链接查找”部分中处理这种情况。
检查刚刚解析的组件是否引用一个目录(next.dentry->d_inode具有自定义
lookup方法)。如果不是,则返回错误-ENOTDIR,因为该组件位于原始路径名的中间。
设置nd->dentryto
next.dentry和nd->mntto next.mnt,然后继续路径名的下一个组成部分。
Executes a cycle that breaks the pathname passed in the
name parameter into components
(the intermediate slashes are treated as filename separators); for
each component found, the function:
Retrieves the address of the inode object of the last
resolved component from nd->dentry->d_inode. (In the
first iteration, the inode refers to the directory from where
to start the pathname lookup.)
Checks that the permissions of the last resolved
component stored into the inode allow execution (in Unix, a
directory can be traversed only if it is executable). If the
inode has a custom permission method, the function
executes it; otherwise, it executes the exec_permission_lite( ) function,
which examines the access mode stored in the i_mode inode field and the
privileges of the running process. In both cases, if the last
resolved component does not allow execution, link_path_walk( ) breaks out of the
cycle and returns an error code.
Considers the next component to be resolved. From its name, the function computes a 32-bit hash value to be used when looking in the dentry cache hash table.
Skips any trailing slash (/) after the slash that terminates the name of the component to be resolved.
If the component to be resolved is the last one in the original pathname, it jumps to step 6.
If the name of the component is "." (a single dot), it continues with the next component ( "." refers to the current directory, so it has no effect inside a pathname).
If the name of the component is ".." (two dots), it tries to climb to the parent directory:
If the last resolved directory is the process's root
directory (nd->dentry is equal to
current->fs->root
and nd->mnt is equal
to current->fs->rootmnt),
then climbing is not allowed: it invokes follow_mount( ) on the last
resolved component (see below) and continues with the next
component.
If the last resolved directory is the root directory
of the nd->mnt
filesystem (nd->dentry is equal to
nd->mnt->mnt_root) and the
nd->mnt filesystem
is not mounted on top of another filesystem (nd->mnt is equal to nd->mnt->mnt_parent), then
the nd->mnt
filesystem is usually[*] the namespace's root filesystem: in this
case, climbing is impossible, thus invokes follow_mount( ) on the last
resolved component (see below) and continues with the next
component.
If the last resolved directory is the root directory
of the nd->mnt
filesystem and the nd->mnt filesystem is mounted
on top of another filesystem, a filesystem switch is
required. So, the function sets nd->dentry to nd->mnt->mnt_mountpoint,
and nd->mnt to
nd->mnt->mnt_parent, then
restarts step 5g (recall that several filesystems can be
mounted on the same mount point).
If the last resolved directory is not the root
directory of a mounted filesystem, then the function must
simply climb to the parent directory: it sets nd->dentry to nd->dentry->d_parent,
invokes follow_mount( )
on the parent directory, and continues with the next
component.
The follow_mount( )
function checks whether nd->dentry is a mount point for
some filesystem (nd->dentry->d_mounted is
greater than zero); in this case, it invokes lookup_mnt( ) to search the root
directory of the mounted filesystem in the dentry
cache , and updates nd->dentry and nd->mnt with the object addresses
corresponding to the mounted filesystem; then, it repeats the
whole operation (there can be several filesystems mounted on
the same mount point). Essentially, invoking the follow_mount( ) function when
climbing to the parent directory is required because the
process could start the pathname lookup from a directory
included in a filesystem hidden by another filesystem mounted
over the parent directory.
The component name is neither "." nor "..", so the
function must look it up in the dentry cache. If the low-level
filesystem has a custom d_hash dentry method, the function
invokes it to modify the hash value already computed in step
5c.
Sets the LOOKUP_CONTINUE flag in nd->flags to denote that there is
a next component to be analyzed.
Invokes do_lookup( )
to derive the dentry object associated with a given parent
directory (nd->dentry)
and filename (the pathname component being resolved). The
function essentially invokes _
_d_lookup( ) first to search the dentry object of
the component in the dentry cache. If no such object exists,
do_lookup( ) invokes
real_lookup( ). This latter
function reads the directory from disk by executing the
lookup method of the inode,
creates a new dentry object and inserts it in the dentry
cache, then creates a new inode object and inserts it into the
inode cache .[*] At the end of this step, the dentry and mnt fields of the next local variable will point,
respectively, to the dentry object and the mounted filesystem
object of the component name to be resolved in this
cycle.
Invokes the follow_mount(
) function to check whether the component just
resolved (next.dentry)
refers to a directory that is a mount point for some
filesystem (next.dentry->d_mounted is greater
than zero). follow_mount( )
updates next.dentry and
next.mnt so that they point
to the dentry object and mounted filesystem object of the
upmost filesystem mounted on the directory specified by this
pathname component (see step 5g).
Checks whether the component just resolved refers to a
symbolic link (next.dentry->d_inode has a custom
follow_link method). We'll
deal with this case in the later section "Lookup of Symbolic
Links."
Checks whether the component just resolved refers to a
directory (next.dentry->d_inode has a custom
lookup method). If not,
returns the error -ENOTDIR,
because the component is in the middle of the original
pathname.
Sets nd->dentry to
next.dentry and nd->mnt to next.mnt, then continues with the
next component of the pathname.
现在,除了最后一个之外,原始路径名的所有组成部分都已解析。清除LOOKUP_CONTINUE中的标志nd->flags。
Now all components of the original pathname are resolved
except the last one. Clears the LOOKUP_CONTINUE flag in nd->flags.
如果路径名尾部有斜杠,则会在局部变量中设置LOOKUP_FOLLOWand ,以强制后面的函数将最后一个部分解释为目录名。LOOKUP_DIRECTORY flagslookup_flags
If the pathname has a trailing slash, it sets the LOOKUP_FOLLOW and LOOKUP_DIRECTORY flags in the lookup_flags local variable to force the
last component to be interpreted by later functions as a directory
name.
LOOKUP_PARENT检查变量中标志的值lookup_flags。下面,我们假设该标志设置为 0,并将相反的情况推迟到下一节。
Checks the value of the LOOKUP_PARENT flag in the lookup_flags variable. In the following,
we assume that the flag is set to 0, and we postpone the opposite
case to the next section.
如果最后一个组件的名称是“.” (单个点),终止执行并返回值 0(无错误)。在指向的nameidata结构
中,和字段引用相对于路径名的倒数第二个组件的对象(每个组件“.”在路径名内没有作用)。nddentrymnt
If the name of the last component is "." (a single dot),
terminates the execution and returns the value 0 (no error). In
the nameidata structure that
nd points to, the dentry and mnt fields refer to the objects relative
to the next-to-last component of the pathname (each component "."
has no effect inside a pathname).
如果最后一个组件的名称是“..”(两个点),它会尝试爬到父目录:
如果最后解析的目录是进程的根目录(nd->dentry等于current->fs->root且nd->mnt等于current->fs->rootmnt),则它调用follow_mount( )倒数第二个组件并终止执行并返回值 0(无错误)。nd->dentry并nd->mnt引用相对于路径名的倒数第二个部分(即进程的根目录)的对象。
如果最后解析的目录是文件系统的根目录nd->mnt(nd->dentry等于
nd->mnt->mnt_root)并且nd->mnt
文件系统未安装在另一个文件系统之上(nd->mnt等于
nd->mnt->mnt_parent),则无法进行攀登,因此调用倒数follow_mount( )第二个组件并终止执行并返回值 0(无错误)。
如果最后解析的目录是文件系统的根目录nd->mnt,并且该nd->mnt
文件系统安装在另一个文件系统之上,则设置
nd->dentry为nd->mnt->mnt_mountpoint和
nd->mntto nd->mnt->mnt_parent,然后重新启动步骤 10。
如果最后解析的目录不是已安装文件系统的根目录,则它设置nd->dentry为nd->dentry->d_parent,
follow_mount( )在父目录上调用,然后终止执行并返回值 0(无错误)。nd->dentry并nd->mnt引用相对于路径名倒数第二个组件之前的组件的对象。
If the name of the last component is ".." (two dots), it tries to climb to the parent directory:
If the last resolved directory is the process's root
directory (nd->dentry is
equal to current->fs->root and nd->mnt is equal to current->fs->rootmnt), it
invokes follow_mount( ) on
the next-to-last component and terminates the execution and
returns the value 0 (no error). nd->dentry and nd->mnt refer to the objects
relative to the next-to-last component of the pathname—that
is, to the root directory of the process.
If the last resolved directory is the root directory of
the nd->mnt filesystem
(nd->dentry is equal to
nd->mnt->mnt_root)
and the nd->mnt
filesystem is not mounted on top of another filesystem
(nd->mnt is equal to
nd->mnt->mnt_parent),
then climbing is impossible, thus invokes follow_mount( ) on the next-to-last
component and terminates the execution and returns the value 0
(no error).
If the last resolved directory is the root directory of
the nd->mnt filesystem
and the nd->mnt
filesystem is mounted on top of another filesystem, it sets
nd->dentry to nd->mnt->mnt_mountpoint and
nd->mnt to nd->mnt->mnt_parent, then
restarts step 10.
If the last resolved directory is not the root directory
of a mounted filesystem, it sets nd->dentry to nd->dentry->d_parent, invokes
follow_mount( ) on the
parent directory, and terminates the execution and returns the
value 0 (no error). nd->dentry and nd->mnt refer to the objects
relative to the component preceding the next-to-last component
of the pathname.
最后一个组件的名称既不是“.”。也不是“..”,因此该函数必须在 dentry 缓存中查找它。如果低级文件系统具有自定义d_hash
dentry 方法,则该函数会调用它来修改在步骤 5c 中计算出的哈希值。
The name of the last component is neither "." nor "..", so
the function must look it up in the dentry cache. If the low-level
filesystem has a custom d_hash
dentry method, the function invokes it to modify the hash value
already computed in step 5c.
调用do_lookup( )以派生与父目录和文件名关联的 dentry 对象(请参阅步骤 5j)。在此步骤结束时,next局部变量包含指向 dentry 和相对于最后一个组件名称的已安装文件系统描述符的指针。
Invokes do_lookup( ) to
derive the dentry object associated with the parent directory and
the filename (see step 5j). At the end of this step, the next local variable contains the
pointers to both the dentry and the mounted filesystem descriptor
relative to the last component name.
调用follow_mount( )
以检查最后一个组件是否是某个文件系统的安装点,如果是这种情况,则使用next相对于最上面安装的文件系统的根目录的 dentry 对象和安装的文件系统对象的地址更新本地变量。
Invokes follow_mount( )
to check whether the last component is a mount point for some
filesystem and, if this is the case, to update the next local variable with the addresses
of the dentry object and mounted filesystem object relative to the
root directory of the upmost mounted filesystem.
检查该标志是否LOOKUP_FOLLOW已设置lookup_flags以及 inode 对象
next.dentry->d_inode是否具有自定义follow_link方法。如果是这种情况,则该组件是必须解释的符号链接,如后面部分“符号链接的查找”中所述。
Checks whether the LOOKUP_FOLLOW flag is set in lookup_flags and the inode object
next.dentry->d_inode has a
custom follow_link method. If
this is the case, the component is a symbolic link that must be
interpreted, as described in the later section "Lookup of Symbolic
Links."
该组件不是符号链接,或者不应解释符号链接。分别使用存储在和中的值设置nd->mnt和字段。最终的dentry对象是整个查找操作的结果。nd->dentrynext.mntnext.dentry
The component is not a symbolic link or the symbolic link
should not be interpreted. Sets the nd->mnt and nd->dentry fields with the value
stored in next.mnt and next.dentry, respectively. The final
dentry object is the result of the whole lookup operation.
检查是否nd->dentry->d_inode是NULL. 当没有与 dentry 对象关联的 inode 时会发生这种情况,通常是因为路径名引用了不存在的文件。在这种情况下,该函数返回错误代码-ENOENT。
Checks whether nd->dentry->d_inode is NULL. This happens when there is no
inode associated with the dentry object, usually because the
pathname refers to a nonexistent file. In this case, the function
returns the error code -ENOENT.
有一个索引节点与路径名的最后一个部分相关联。如果LOOKUP_DIRECTORY在 中设置了该标志lookup_flags,它会检查 inode 是否具有自定义lookup方法,即它是一个目录。如果不是,该函数返回错误代码
-ENOTDIR。
There is an inode associated with the last component of the
pathname. If the LOOKUP_DIRECTORY flag is set in lookup_flags, it checks that the inode
has a custom lookup method—that
is, it is a directory. If not, the function returns the error code
-ENOTDIR.
返回值 0(无错误)。nd->dentry并nd->mnt引用路径名的最后一个组成部分。
Returns the value 0 (no error). nd->dentry and nd->mnt refer to the last component
of the pathname.
在许多情况下,查找操作的真正目标不是路径名的最后一个组成部分,而是倒数第二个组成部分。例如,当创建文件时,最后一个组件表示尚不存在的文件的文件名,路径名的其余部分指定必须在其中插入新链接的目录。因此,查找操作应该获取倒数第二个组件的 dentry 对象。再例如,取消链接由路径名/foo/bar标识的文件包括从目录 foo中删除bar。因此,内核真正感兴趣的是访问目录foo而不是bar。
In many cases, the real target of a lookup operation is not the last component of the pathname, but the next-to-last one. For example, when a file is created, the last component denotes the filename of the not yet existing file, and the rest of the pathname specifies the directory in which the new link must be inserted. Therefore, the lookup operation should fetch the dentry object of the next-to-last component. For another example, unlinking a file identified by the pathname /foo/bar consists of removing bar from the directory foo. Thus, the kernel is really interested in accessing the directory foo rather than bar.
LOOKUP_PARENT每当查找操作必须解析包含路径名最后一个组成部分的目录而不是最后一个组成部分本身时,就使用该标志。
The LOOKUP_PARENT flag is
used whenever the lookup operation must resolve the directory
containing the last component of the pathname, rather than the last
component itself.
LOOKUP_PARENT设置标志后,该link_path_walk( )
函数还会设置数据结构的last和
字段。该字段存储路径名中最后一个组件的名称。该字段标识最后一个组件的类型;它可以设置为表 12-17中所示的值之一。last_typenameidatalastlast_type
When the LOOKUP_PARENT flag
is set, the link_path_walk( )
function also sets up the last and
last_type fields of the nameidata data structure. The last field stores the name of the last
component in the pathname. The last_type field identifies the type of the
last component; it may be set to one of the values shown in Table 12-17.
表 12-17。nameidata 数据结构中的last_type 字段的值
Table 12-17. The values of the last_type field in the nameidata data structure
价值 Value | 描述 Description |
|---|---|
| 最后一个组件是常规文件名 Last component is a regular filename |
| 最后一个组成部分是“ / ”(即整个路径名是“ / ”) Last component is "/ " (that is, the entire pathname is "/ ") |
| 最后一个组成部分是“.”。 Last component is "." |
| 最后一个组成部分是“..” Last component is ".." |
| 最后一个组件是到特殊文件系统的符号链接 Last component is a symbolic link into a special filesystem |
该LAST_ROOT标志是整个路径名查找操作开始时设置的默认值(请参阅“路径名查找path_lookup( )”
部分开头的描述)。如果路径名只是“ / ”,则内核不会更改该
字段的初始值。last_type
The LAST_ROOT flag is the
default value set by path_lookup( )
when the whole pathname lookup operation starts (see the description
at the beginning of the section "Pathname Lookup"). If the
pathname turns out to be simply "/ ", the kernel does not change the initial
value of the last_type
field.
该字段的其余值由标志设置时last_type设置;在这种情况下,该函数执行上一节中描述的相同步骤,直到步骤 8。但是,从步骤 8 开始,路径名最后一个组成部分的查找操作有所不同:link_path_walk( )LOOKUP_PARENT
The remaining values of the last_type field are set by link_path_walk( ) when the LOOKUP_PARENT flag is set; in this case, the
function performs the same steps described in the previous section up
to step 8. From step 8 onward, however, the lookup operation for the
last component of the pathname is different:
设置nd->last为最后一个组件的名称。
Sets nd->last to the
name of the last component.
初始化nd->last_type为LAST_NORM.
Initializes nd->last_type to LAST_NORM.
如果最后一个组件的名称是“.” (单个点),它设置nd->last_type为
LAST_DOT.
If the name of the last component is "." (a single dot), it
sets nd->last_type to
LAST_DOT.
如果最后一个组件的名称是“..”(两个点),则它设置nd->last_type为
LAST_DOTDOT。
If the name of the last component is ".." (two dots), it
sets nd->last_type to
LAST_DOTDOT.
返回值 0(无错误)。
Returns the value 0 (no error).
正如您所看到的,最后一个组件根本没有被解释。因此,当函数终止时,数据结构的dentry和mnt字段nameidata指向与包含最后一个组件的目录相关的对象。
As you can see, the last component is not interpreted at all.
Thus, when the function terminates, the dentry and mnt fields of the nameidata data structure point to the
objects relative to the directory that includes the last
component.
回想一下,符号链接是存储另一个文件的路径名的常规文件。路径名可能包含符号链接,它们必须由内核解析。
Recall that a symbolic link is a regular file that stores a pathname of another file. A pathname may include symbolic links, and they must be resolved by the kernel.
例如,如果/foo/bar是指向(包含路径名)../dir 的符号链接,则内核必须将路径名/foo/bar/file解析为对文件/dir/file的引用。在此示例中,内核必须执行两个不同的查找操作。第一个解析 /foo/bar:当内核发现bar时是符号链接的名称,它必须检索其内容并将其解释为另一个路径名。第二个路径名操作从第一个操作到达的目录开始,一直持续到符号链接路径名的最后一个组成部分被解析为止。接下来,原始查找操作从第二个目录中到达的目录项开始恢复,并且组件遵循原始路径名中的符号链接。
For example, if /foo/bar is a symbolic link pointing to (containing the pathname) ../dir, the pathname /foo/bar/file must be resolved by the kernel as a reference to the file /dir/file. In this example, the kernel must perform two different lookup operations. The first one resolves /foo/bar: when the kernel discovers that bar is the name of a symbolic link, it must retrieve its content and interpret it as another pathname. The second pathname operation starts from the directory reached by the first operation and continues until the last component of the symbolic link pathname has been resolved. Next, the original lookup operation resumes from the dentry reached in the second one and with the component following the symbolic link in the original pathname.
为了使情况更加复杂,符号链接中包含的路径名可能包含其他符号链接。您可能认为解析符号链接的内核代码很难理解,但事实并非如此;代码实际上非常简单,因为它是递归的。
To further complicate the scenario, the pathname included in a symbolic link may include other symbolic links. You might think that the kernel code that resolves the symbolic links is hard to understand, but this is not true; the code is actually quite simple because it is recursive.
然而,不受控制的递归本质上是危险的。例如,假设一个符号链接指向其自身。当然,解析包含此类符号链接的路径名可能会导致无休止的递归调用,进而迅速导致内核堆栈溢出。当前进程描述符中的字段link_count用于避免这个问题:该字段在每次递归执行之前增加,并在执行后立即减少。如果尝试第六次嵌套查找操作,则整个查找操作将以错误代码终止。因此,符号链接的嵌套层数最多为 5 层。
However, untamed recursion is intrinsically dangerous. For
instance, suppose that a symbolic link points to itself. Of course,
resolving a pathname including such a symbolic link may induce an
endless stream of recursive invocations, which in turn quickly leads
to a kernel stack overflow. The link_count field in the descriptor of the
current process is used to avoid the problem: the field is increased
before each recursive execution and decreased right after. If a sixth
nested lookup operation is attempted, the whole lookup operation
terminates with an error code. Therefore, the level of nesting of
symbolic links can be at most 5.
此外,total_link_count当前进程描述符中的字段跟踪在原始查找操作中遵循了多少个符号链接(甚至是非嵌套的)。如果该计数器达到值 40,则查找操作将中止。如果没有这个计数器,恶意用户可能会创建一个包含许多连续符号链接的病态路径名,这些符号链接会在很长的查找操作中冻结内核。
Furthermore, the total_link_count field in the descriptor of
the current process keeps track of how many symbolic links (even
nonnested) were followed in the original lookup operation. If this
counter reaches the value 40, the lookup operation aborts. Without
this counter, a malicious user could create a pathological pathname
including many consecutive symbolic links that freeze the kernel in a
very long lookup operation.
这就是代码的基本工作原理:一旦函数link_path_walk( )检索到与路径名的某个组成部分关联的 dentry 对象,它就会检查相应的 inode 对象是否具有自定义方法(请参阅“标准路径名查找follow_link”部分中的步骤 5l 和步骤 14 ) 。如果是这样,则 inode 是一个符号链接,必须在继续原始路径名的查找操作之前对其进行解释。
This is how the code basically works: once the link_path_walk( ) function retrieves the
dentry object associated with a component of the pathname, it checks
whether the corresponding inode object has a custom follow_link method (see step 5l and step 14
in the section "Standard
Pathname Lookup"). If so, the inode is a symbolic link that
must be interpreted before proceeding with the lookup operation of the
original pathname.
在这种情况下,该link_path_walk(
)函数调用,将符号链接的 dentry 对象的do_follow_link(
)地址和数据结构的
地址传递给它。依次
执行以下步骤:dentryndnameidatado_follow_link( )
In this case, the link_path_walk(
) function invokes do_follow_link(
), passing to it the address dentry of the dentry object of the symbolic
link and the address nd of the
nameidata data structure. In turn,
do_follow_link( ) performs the
following steps:
检查current->link_count小于 5 的;否则,返回错误代码-ELOOP。
Checks that current->link_count is less than 5;
otherwise, it returns the error code -ELOOP.
current->total_link_count检查少于 40 的支票;否则,返回错误代码-ELOOP。
Checks that current->total_link_count is less
than 40; otherwise, it returns the error code -ELOOP.
cond_resched( )
如果当前进程需要(当前进程集的描述符TIF_NEED_RESCHED中
的标志),则调用以执行进程切换。thread_info
Invokes cond_resched( )
to perform a process switch if required by the current process
(flag TIF_NEED_RESCHED in the
thread_info descriptor of the
current process set).
增加current->link_count、current->total_link_count、 和
nd->depth。
Increases current->link_count, current->total_link_count, and
nd->depth.
更新与要解析的符号链接关联的 inode 对象的访问时间。
Updates the access time of the inode object associated with the symbolic link to be resolved.
调用与文件系统相关的函数,该函数实现follow_link向其传递dentry和nd参数的方法。该函数提取存储在符号链接的 inode 中的路径名,并将该路径名保存在数组的正确条目中nd->saved_names。
Invokes the filesystem-dependent function that implements
the follow_link method passing
to it the dentry and nd parameters. This function extracts
the pathname stored in the symbolic link's inode, and saves this
pathname in the proper entry of the nd->saved_names array.
调用_ _vfs_follow_link(
)向其传递地址nd和数组中路径名的地址的函数nd->saved_names(见下文)。
Invokes the _ _vfs_follow_link(
) function passing to it the address nd and the address of the pathname in
the nd->saved_names array
(see below).
如果定义,则执行put_linkinode 对象的方法,从而释放该
follow_link方法分配的临时数据结构。
If defined, executes the put_link method of the inode object,
thus releasing the temporary data structures allocated by the
follow_link method.
减少current->link_count和nd->depth字段。
Decreases the current->link_count and nd->depth fields.
返回函数返回的错误代码_ _vfs_follow_link( )(0 表示无错误)。
Returns the error code returned by the _ _vfs_follow_link( ) function (0 for no
error).
反过来,它_ _vfs_follow_link(
)基本上执行以下操作:
In turn, the _ _vfs_follow_link(
) does essentially the following:
检查存储在符号链接中的路径名的第一个字符是否是斜杠:在这种情况下,已找到绝对路径名,因此无需在内存中保留有关先前路径的任何信息。如果是,则调用path_release( )该nameidata结构,从而释放先前查找步骤产生的对象;然后,该函数将数据结构的dentry和
字段设置为当前进程根目录。mntnameidata
Checks whether the first character of the pathname stored in
the symbolic link is a slash: in this case an absolute pathname
has been found, so there is no need to keep in memory any
information about the previous path. If so, invokes path_release( ) on the nameidata structure, thus releasing the
objects resulting from the previous lookup steps; then, the
function sets the dentry and
mnt fields of the nameidata data structure to the current
process root directory.
调用link_path_walk( )
解析符号链接路径名,将路径名和nd.
Invokes link_path_walk( )
to resolve the symbolic link pathname, passing to it as parameters
the pathname and nd.
返回取自 的值link_path_walk( )。
Returns the value taken from link_path_walk( ).
当do_follow_link( )
finally终止时,它已将局部变量dentry的字段设置next为原始执行的符号链接所引用的dentry对象的地址link_path_walk( )。然后该
link_path_walk( )函数可以继续下一步。
When do_follow_link( )
finally terminates, it has set the dentry field of the next local variable with the address of the
dentry object referred to by the symbolic link to the original
execution of link_path_walk( ). The
link_path_walk( ) function can then
proceed with the next step.
[ * ]网络文件系统也可能发生这种情况与命名空间的目录树断开连接。
[*] This case can also occur for network filesystems disconnected from the namespace's directory tree.
[ * ]在少数情况下,该函数可能会在 inode 缓存中找到所需的 inode。当路径名组件是最后一个并且它不引用目录,相应的文件有多个硬链接,并且最后最近通过与该路径名中使用的硬链接不同的硬链接访问该文件时,就会发生这种情况。
[*] In a few cases, the function might find the required inode already in the inode cache. This happens when the pathname component is the last one and it does not refer to a directory, the corresponding file has several hard links, and finally the file has been recently accessed through a hard link different from the one used in this pathname.
为了简洁起见,我们不讨论实现表 12-1中列出的所有 VFS 系统调用。然而,勾勒出一些系统调用的实现可能很有用,以便展示 VFS 的数据结构如何交互。
For the sake of brevity, we cannot discuss the implementation of all the VFS system calls listed in Table 12-1. However, it could be useful to sketch out the implementation of a few system calls, in order to show how VFS's data structures interact.
让我们重新考虑本章开头提出的示例:用户发出复制 MS-DOS 的 shell 命令文件/floppy/TEST到 Ext2 文件/tmp/test。命令 shell 调用外部程序,例如cp,我们假设它执行以下代码片段:
Let's reconsider the example proposed at the beginning of this chapter: a user issues a shell command that copies the MS-DOS file /floppy/TEST to the Ext2 file /tmp/test. The command shell invokes an external program such as cp, which we assume executes the following code fragment:
inf = open("/软盘/TEST", O_RDONLY, 0);
outf = open("/tmp/test", O_WRONLY | O_CREAT | O_TRUNC, 0600);
做 {
len = 读取(inf, buf, 4096);
写(outf,buf,len);
while (len);
关闭(outf);
关闭(inf); inf = open("/floppy/TEST", O_RDONLY, 0);
outf = open("/tmp/test", O_WRONLY | O_CREAT | O_TRUNC, 0600);
do {
len = read(inf, buf, 4096);
write(outf, buf, len);
} while (len);
close(outf);
close(inf);实际上,真正的cp程序的代码更复杂,因为它还必须检查每个系统调用返回的可能的错误代码。在我们的示例中,我们将注意力集中在复制操作的“正常”行为上。
Actually, the code of the real cp program is more complicated, because it must also check for possible error codes returned by each system call. In our example, we focus our attention on the "normal" behavior of a copy operation.
该open( )系统调用由该函数提供服务,该函数接收要打开的文件的sys_open( )
路径名、一些访问模式标志以及权限位掩码(如果必须创建该文件)作为其参数。如果系统调用成功,它会返回一个文件描述符,即在文件对象指针数组中分配给新文件的索引;否则,返回-1。filenameflagsmodecurrent->files->fd
The open( ) system
call is serviced by the sys_open( )
function, which receives as its parameters the pathname filename of the file to be opened, some
access mode flags flags, and a
permission bit mask mode if the
file must be created. If the system call succeeds, it returns a file
descriptor—that is, the index assigned to the new file in the current->files->fd array of pointers
to file objects; otherwise, it returns -1.
在我们的示例中,open( )被调用两次;第一次打开/floppy/TEST进行读取(O_RDONLY标志),第二次打开
/tmp/test进行写入(O_WRONLY标志)。如果/tmp/test尚不存在,则会创建它(O_CREAT标志),并为所有者提供独占的读写访问权限(0600第三个参数中的八进制数)。
In our example, open( ) is
invoked twice; the first time to open /floppy/TEST for reading (O_RDONLY flag) and the second time to open
/tmp/test for writing (O_WRONLY flag). If /tmp/test does not already exist, it is
created (O_CREAT flag) with
exclusive read and write access for the owner (octal 0600 number in the third parameter).
相反,如果文件已经存在,则从头开始重写(O_TRUNC标志)。表12-18列出了系统调用的所有标志open( )。
Conversely, if the file already exists, it is rewritten from
scratch (O_TRUNC flag). Table 12-18 lists all
flags of the open( ) system
call.
表 12-18。open()系统调用的标志
Table 12-18. The flags of the open( ) system call
旗帜名称 Flag name | 描述 Description |
|---|---|
| 开放阅读 Open for reading |
| 开放写作 Open for writing |
| 开放阅读和写作 Open for both reading and writing |
| 如果文件不存在则创建 Create the file if it does not exist |
| 使用 With |
| 切勿将文件视为控制终端 Never consider the file as a controlling terminal |
| 截断文件(删除所有现有内容) Truncate the file (remove all existing contents) |
| 始终写在文件末尾 Always write at end of the file |
| 系统调用不会阻塞该文件 No system calls will block on the file |
| 与...一样 Same as |
| 同步写入(阻塞直到物理写入终止) Synchronous write (block until physical write terminates) |
| 通过信号通知 I/O 事件 I/O event notification via signals |
| 直接 I/O 传输(无内核缓冲) Direct I/O transfer (no kernel buffering) |
| 大文件(大小大于 2 GB) Large file (size greater than 2 GB) |
| 如果文件不是目录则失败 Fail if file is not a directory |
| 不要遵循路径名中的尾随符号链接 Do not follow a trailing symbolic link in pathname |
O_NOATIME O_NOATIME | 不更新 inode 的上次访问时间 Do not update the inode's last access time |
我们来描述一下该sys_open( )函数的操作。它执行以下步骤:
Let's describe the operation of the sys_open( ) function. It performs the
following steps:
调用getname( )从进程地址空间读取文件路径名。
Invokes getname( ) to
read the file pathname from the process address space.
调用get_unused_fd( )
以查找 中的空槽current->files->fd。相应的索引(新的文件描述符)存储在
fd局部变量中。
Invokes get_unused_fd( )
to find an empty slot in current->files->fd. The
corresponding index (the new file descriptor) is stored in the
fd local variable.
调用该filp_open( )
函数,将路径名、访问模式标志和权限位掩码作为参数传递。该函数依次执行以下步骤:
将访问模式标志复制到 中namei_flags,但使用特殊格式对访问模式标志O_RDONLY、
O_WRONLY和进行编码:仅当文件访问需要读取权限时才设置O_RDWR索引 0(最低位)处的位;namei_flags类似地,仅当文件访问需要写权限时才会设置索引 1 处的位。请注意,不可能在open( )系统调用中指定文件访问不需要读或写权限;然而,这在涉及符号链接的路径名查找操作中是有意义的。
调用open_namei(
),向其传递路径名、修改的访问模式标志和本地数据结构的地址nameidata。该函数按以下方式执行查找操作:
如果O_CREAT未在访问模式标志中设置 ,则以LOOKUP_PARENT
未设置标志和LOOKUP_OPEN设置标志的方式开始查找操作。而且,仅当清除
LOOKUP_FOLLOW时才设置该标志,而仅当设置该标志时才设置该标志。O_NOFOLLOWLOOKUP_DIRECTORYO_DIRECTORY
如果在访问模式标志中设置了 ,则使用设置的、
和
标志O_CREAT启动查找操作。函数成功返回后,检查请求的文件是否已存在。如果没有,则通过调用父inode的方法分配新的磁盘inode 。LOOKUP_PARENTLOOKUP_OPENLOOKUP_CREATEpath_lookup(
)create
该open_namei( )
函数还对查找操作找到的文件执行多项安全检查。例如,该函数根据访问模式标志检查与找到的 dentry 对象关联的 inode 是否确实存在、是否是常规文件以及是否允许当前进程访问它。此外,如果打开文件进行写入,该函数会检查该文件是否未被其他进程锁定。
调用该dentry_open(
)函数,向其传递由查找操作定位的 dentry 对象和已安装文件系统对象的地址以及访问模式标志。反过来,这个函数:
分配一个新的文件对象。
根据传递给系统调用的访问模式标志初始化文件对象的f_flags和字段。f_modeopen( )
根据作为参数传递的 dentry 对象和挂载的文件系统对象的地址初始化文件对象的f_dentry和字段。f_vfsmnt
将字段设置为相应 inode 对象的字段f_op
内容。i_fop这为将来的文件操作设置了所有方法。
s_files将文件对象插入到文件系统超级块字段指向的打开文件列表中。
如果open
定义了文件操作的方法,则该函数将调用它。
调用file_ra_state_init(
)以初始化预读数据结构(请参阅第 16 章)。
如果O_DIRECT
设置了该标志,它将检查是否可以对文件执行直接 I/O 操作(参见第 16 章)。
返回文件对象的地址。
返回文件对象的地址。
Invokes the filp_open( )
function, passing as parameters the pathname, the access mode
flags, and the permission bit mask. This function, in turn,
executes the following steps:
Copies the access mode flags into namei_flags, but encodes the access
mode flags O_RDONLY,
O_WRONLY, and O_RDWR with a special format: the
bit at index 0 (lowest-order) of namei_flags is set only if the file
access requires read privileges; similarly, the bit at index 1
is set only if the file access requires write privileges.
Notice that it is not possible to specify in the open( ) system call that a file
access does not require either read or write privileges; this
makes sense, however, in a pathname lookup operation involving
symbolic links.
Invokes open_namei(
), passing to it the pathname, the modified access
mode flags, and the address of a local nameidata data structure. The
function performs the lookup operation in the following
manner:
If O_CREAT is not
set in the access mode flags, starts the lookup operation
with the LOOKUP_PARENT
flag not set and the LOOKUP_OPEN flag set. Moreover,
the LOOKUP_FOLLOW flag
is set only if O_NOFOLLOW is cleared, while the
LOOKUP_DIRECTORY flag
is set only if the O_DIRECTORY flag is set.
If O_CREAT is set
in the access mode flags, starts the lookup operation with
the LOOKUP_PARENT,
LOOKUP_OPEN, and
LOOKUP_CREATE flags
set. Once the path_lookup(
) function successfully returns, checks whether
the requested file already exists. If not, allocates a new
disk inode by invoking the create method of the parent
inode.
The open_namei( )
function also executes several security checks on the file
located by the lookup operation. For instance, the function
checks whether the inode associated with the dentry object
found really exists, whether it is a regular file, and whether
the current process is allowed to access it according to the
access mode flags. Also, if the file is opened for writing,
the function checks that the file is not locked by other
processes.
Invokes the dentry_open(
) function, passing to it the addresses of the
dentry object and the mounted filesystem object located by the
lookup operation, and the access mode flags. In turn, this
function:
Allocates a new file object.
Initializes the f_flags and f_mode fields of the file object
according to the access mode flags passed to the open( ) system call.
Initializes the f_dentry and f_vfsmnt fields of the file
object according to the addresses of the dentry object and
the mounted filesystem object passed as parameters.
Sets the f_op
field to the contents of the i_fop field of the corresponding
inode object. This sets up all the methods for future file
operations.
Inserts the file object into the list of opened
files pointed to by the s_files field of the
filesystem's superblock.
If the open
method of the file operations is defined, the function
invokes it.
Invokes file_ra_state_init(
) to initialize the read-ahead data structures
(see Chapter
16).
If the O_DIRECT
flag is set, it checks whether direct I/O operations can
be performed on the file (see Chapter 16).
Returns the address of the file object.
Returns the address of the file object.
设置current->files->fd[fd]为返回的文件对象的地址dentry_open( )。
Sets current->files->fd[fd] to the
address of the file object returned by dentry_open( ).
返回fd。
Returns fd.
让我们回到cp示例中的代码
。系统open( )调用返回两个文件描述符,它们存储在inf和outf变量中。然后程序开始一个循环:在每次迭代时,将/floppy/TEST文件的一部分复制到本地缓冲区(read( )系统调用),然后将本地缓冲区中的数据写入/tmp/test文件(write( )系统调用)称呼)。
Let's return to the code in our cp
example. The open( ) system calls
return two file descriptors, which are stored in the inf and outf variables. Then the program starts a
loop: at each iteration, a portion of the /floppy/TEST file is copied into a local
buffer (read( ) system call), and
then the data in the local buffer is written into the /tmp/test file (write( ) system call).
和系统调用非常相似read( )。write( )两者都需要三个参数:文件描述符fd、内存区域的地址(包含要传输的数据的缓冲区)以及指定应传输多少字节的buf数字。count当然,read( )
将数据从文件传输到缓冲区,而 whilewrite( )则相反。两个系统调用都返回已成功传输的字节数或 -1 以表示错误情况。
The read( ) and write( ) system calls are quite similar.
Both require three parameters: a file descriptor fd, the address buf of a memory area (the buffer containing
the data to be transferred), and a number count that specifies how many bytes should
be transferred. Of course, read( )
transfers the data from the file into the buffer, while write( ) does the opposite. Both system
calls return either the number of bytes that were successfully
transferred or -1 to signal an error condition.
返回值小于count并不意味着发生了错误。即使未传输所有请求的字节,也始终允许内核终止系统调用,并且用户应用程序必须相应地检查返回值并在必要时重新发出系统调用。通常,当从管道或终端设备读取、读取超过文件末尾或系统调用被信号中断时,会返回一个小值。文件结束条件 (EOF) 可以通过 的零返回值轻松识别
read( )。该条件不会与由于信号导致的异常终止相混淆,因为如果
read( )在读取数据之前被信号中断,则会发生错误。
A return value less than count does not mean that an error occurred.
The kernel is always allowed to terminate the system call even if not
all requested bytes were transferred, and the user application must
accordingly check the return value and reissue, if necessary, the
system call. Typically, a small value is returned when reading from a
pipe or a terminal device, when reading past the end of the file, or
when the system call is interrupted by a signal. The end-of-file
condition (EOF) can easily be recognized by a zero return value from
read( ). This condition will not be
confused with an abnormal termination due to a signal, because if
read( ) is interrupted by a signal
before a data is read, an error occurs.
读取或写入操作始终发生在当前文件指针(f_pos文件对象的字段)指定的文件偏移处。两个系统调用都通过添加传输的字节数来更新文件指针。
The read or write operation always takes place at the file
offset specified by the current file pointer (field f_pos of the file object). Both system calls
update the file pointer by adding the number of transferred bytes to
it.
简而言之,sys_read( )
(theread( )的服务例程) 和
sys_write( )(thewrite( )的服务例程) 执行几乎相同的步骤:
In short, both sys_read( )
(the read( )'s service routine) and
sys_write( ) (the write( )'s service routine) perform almost
the same steps:
调用fget_light( )以从相应文件对象的fd地址
派生(请参阅前面的“与进程关联的文件file”部分)。
Invokes fget_light( ) to
derive from fd the address
file of the corresponding file
object (see the earlier section "Files Associated with a
Process").
如果 中的标志file->f_mode不允许所请求的访问(读或写操作),则返回错误代码-EBADF。
If the flags in file->f_mode do not allow the
requested access (read or write operation), it returns the error
code -EBADF.
如果file对象没有read( )or aio_read( )(write( )或aio_write( ))文件操作,则返回错误代码-EINVAL。
If the file object does
not have a read( ) or aio_read( ) (write( ) or aio_write( )) file operation, it returns
the error code -EINVAL.
Invokes access_ok() to
perform a coarse check on the buf and count parameters (see the section "Verifying the
Parameters" in Chapter
10).
调用检查rw_verify_area( )
要访问的文件部分是否存在冲突的强制锁。如果是这样,它会返回一个错误代码,或者如果已使用命令请求锁定,则将当前进程置于睡眠状态(请参阅本章后面的“文件锁定F_SETLKW”部分)。
Invokes rw_verify_area( )
to check whether there are conflicting mandatory locks for the
file portion to be accessed. If so, it returns an error code, or
puts the current process to sleep if the lock has been requested
with a F_SETLKW command (see
the section "File
Locking" later in this chapter).
如果定义,它会调用file->f_op->read或file->f_op->write方法来传输数据;否则,调用file->f_op->aio_readorfile->f_op->aio_write方法。所有这些方法(将在第 16 章中讨论)都会返回实际传输的字节数。作为副作用,文件指针会正确更新。
If defined, it invokes either the file->f_op->read or file->f_op->write method to
transfer the data; otherwise, invokes either the file->f_op->aio_read or file->f_op->aio_write method. All
these methods, which are discussed in Chapter 16, return the number
of bytes that were actually transferred. As a side effect, the
file pointer is properly updated.
调用fput_light( )释放文件对象。
Invokes fput_light( ) to
release the file object.
返回实际传输的字节数。
Returns the number of bytes actually transferred.
read( )当系统调用返回值 0 时(即/floppy/TEST的所有字节都已复制到
/tmp/test中),示例代码中的循环终止
。然后程序可以关闭打开的文件,因为复制操作已完成。
The loop in our example code terminates when the
read( ) system call returns the
value 0—that is, when all bytes of /floppy/TEST have been copied into
/tmp/test. The program can then
close the open files, because the copy operation has completed.
该close( )系统调用接收作为其参数的fd,它是要关闭的文件的文件描述符。服务sys_close( )例程执行以下操作:
The close( ) system call
receives as its parameter fd, which
is the file descriptor of the file to be closed. The sys_close( ) service routine performs the
following operations:
获取存储在的文件对象地址current->files->fd[fd];如果是
NULL,则返回错误代码。
Gets the file object address stored in current->files->fd[fd]; if it is
NULL, returns an error
code.
设置current->files->fd[fd]为NULL. 通过清除和字段fd中的相应位来释放文件描述符
(有关“执行时关闭”标志,请参阅第 20 章)。open_fdsclose_on_execcurrent->files
Sets current->files->fd[fd] to NULL. Releases the file descriptor
fd by clearing the
corresponding bits in the open_fds and close_on_exec fields of current->files (see Chapter 20 for the Close on
Execution flag).
调用filp_close( ),执行以下操作:
调用flush
文件操作的方法(如果已定义)。
释放文件上的所有强制锁定(如果有)(请参阅下一节)。
调用fput( )释放文件对象。
Invokes filp_close( ),
which performs the following operations:
Invokes the flush
method of the file operations, if defined.
Releases all mandatory locks on the file, if any (see next section).
Invokes fput( ) to
release the file object.
返回 0 或错误代码。flush该方法或先前对文件的写入操作中的错误可能会引发错误代码。
Returns 0 or an error code. An error code can be raised by
the flush method or by an error
in a previous write operation on the file.
当一个文件可以被多个进程访问时,就会出现同步问题。如果两个进程尝试写入同一文件位置会发生什么?或者,如果一个进程从一个文件位置读取数据,而另一个进程正在向该文件位置写入数据,会发生什么情况?
When a file can be accessed by more than one process, a synchronization problem occurs. What happens if two processes try to write in the same file location? Or again, what happens if a process reads from a file location while another process is writing into it?
在传统的 Unix 系统中,对同一文件位置的并发访问会产生不可预测的结果。然而,Unix系统提供了一种机制,允许进程锁定文件区域,以便可以轻松避免并发访问。
In traditional Unix systems, concurrent accesses to the same file location produce unpredictable results. However, Unix systems provide a mechanism that allows the processes to lock a file region so that concurrent accesses may be easily avoided.
POSIX 标准需要基于以下内容的文件锁定机制:
fcntl( ) 系统调用。可以锁定文件的任意区域(甚至单个字节)或锁定整个文件(包括将来附加的数据)。由于进程可以选择仅锁定文件的一部分,因此它也可以对文件的不同部分持有多个锁。
The POSIX standard requires a file-locking mechanism based on the
fcntl( ) system call. It is possible to lock an arbitrary region
of a file (even a single byte) or to lock the whole file (including data
appended in the future). Because a process can choose to lock only a
part of a file, it can also hold multiple locks on different parts of
the file.
这种锁不会阻止另一个不知道锁定的进程。就像用于保护代码中关键区域的信号量一样,该锁被视为“建议”,因为除非其他进程在访问文件之前配合检查锁是否存在,否则它不会起作用。因此,POSIX的锁被称为 咨询锁 。
This kind of lock does not keep out another process that is ignorant of locking. Like a semaphore used to protect a critical region in code, the lock is considered "advisory" because it doesn't work unless other processes cooperate in checking the existence of a lock before accessing the file. Therefore, POSIX's locks are known as advisory locks .
传统BSD变体通过实现咨询锁定flock( ) 系统调用。此调用不允许进程锁定文件区域,而只能锁定整个文件。传统系统V变体提供了lockf(
) 库函数,它只是
fcntl( ).
Traditional BSD variants implement advisory locking through the flock( ) system call. This call does not allow a process to lock a
file region, only the whole file. Traditional System V variants provide the lockf(
) library function, which is simply an interface to
fcntl( ).
更重要的是,System V Release 3 引入了
强制锁定:内核检查每次调用open( ) ,read( ) , 和write( )
系统调用不会违反对正在访问的文件的强制锁定。因此,强制加锁即使在非合作进程之间也会强制执行。[ * ]
More importantly, System V Release 3 introduced
mandatory locking: the kernel checks that every
invocation of the open( ) , read( ) , and write( )
system calls does not violate a mandatory lock on the
file being accessed. Therefore, mandatory locks are enforced even between noncooperative
processes.[*]
无论进程使用咨询锁还是强制锁,它们都可以使用共享读锁 和独占写锁 。多个进程可能对某个文件区域拥有读锁,但只有一个进程可以同时对该文件区域拥有写锁。此外,当另一个进程拥有同一文件区域的读锁时,不可能获得写锁,反之亦然。
Whether processes use advisory or mandatory locks, they can use both shared read locks and exclusive write locks . Several processes may have read locks on some file region, but only one process can have a write lock on it at the same time. Moreover, it is not possible to get a write lock when another process owns a read lock for the same file region, and vice versa.
Linux 支持所有类型的文件锁定:建议锁定和强制锁定,以及fcntl( )
和flock( )系统调用(lockf( )作为标准库函数实现)。
Linux supports all types of file locking: advisory and
mandatory locks, plus the fcntl( )
and flock( ) system calls (lockf( ) is implemented as a standard
library function).
flock(
)每个类 Unix 操作系统中系统调用的预期行为是仅生成咨询锁,而不考虑MS_MANDLOCK挂载标志。flock( )然而,在 Linux 中,使用一种特殊的's 强制锁来支持某些专有网络文件系统。就是所谓的共享模式强制锁;设置后,其他进程都不能打开与锁的访问模式冲突的文件。不鼓励在本机 Unix 应用程序中使用此功能,因为生成的源代码将是不可移植的。
The expected behavior of the flock(
) system call in every Unix-like operating system is to
produce advisory locks only, without regard for the MS_MANDLOCK mount flag. In Linux, however, a
special kind of flock( )'s
mandatory lock is used to support some proprietary network
filesystems . It is the so-called share-mode mandatory
lock; when set, no other process may open a file that would
conflict with the access mode of the lock. Use of this feature for
native Unix applications is discouraged, because the resulting source
code will be nonportable.
Linux 中引入了另一种fcntl(
)基于 的强制锁,称为租约。当进程尝试打开受租约保护的文件时,它会像平常一样被阻止。然而,拥有锁的进程会收到一个信号。一旦收到通知,它应该首先更新文件,使其内容一致,然后释放锁。如果所有者没有在明确定义的时间间隔内执行此操作(可通过将秒数写入/proc进行调整)/sys/fs/lease-break-time,通常为 45 秒),租约会被内核自动删除,并允许被阻塞的进程继续运行。
Another kind of fcntl(
)-based mandatory lock called lease has
been introduced in Linux. When a process tries to open a file
protected by a lease, it is blocked as usual. However, the process
that owns the lock receives a signal. Once informed, it should first
update the file so that its content is consistent, and then release
the lock. If the owner does not do this in a well-defined time
interval (tunable by writing a number of seconds into /proc /sys/fs/lease-break-time, usually 45
seconds), the lease is automatically removed by the kernel and the
blocked process is allowed to continue.
进程可以通过两种可能的方式获取或释放文件上的建议文件锁:
A process can get or release an advisory file lock on a file in two possible ways:
通过发出flock( )
系统调用。系统调用的两个参数是
fd文件描述符和指定锁定操作的命令。该锁适用于整个文件。
By issuing the flock( )
system call. The two parameters of the system call are the
fd file descriptor, and a
command to specify the lock operation. The lock applies to the
whole file.
通过使用fcntl( )
系统调用。系统调用的三个参数是
fd文件描述符、指定锁定操作的命令和指向结构的指针flock(见表12-20)。该结构中的几个字段允许进程指定要锁定的文件部分。因此,进程可以在同一文件的不同部分上持有多个锁。
By using the fcntl( )
system call. The three parameters of the system call are the
fd file descriptor, a command
to specify the lock operation, and a pointer to a flock structure (see Table 12-20). A
couple of fields in this structure allow the process to specify
the portion of the file to be locked. Processes can thus hold
several locks on different portions of the same file.
和fcntl( )系统
flock( )调用可以同时在同一个文件上使用,但是通过 锁定的文件fcntl( )不会出现锁定到 的情况flock( ),反之亦然。这样做是有意为之,以避免当使用某种类型的锁的应用程序依赖于使用另一种类型的库时发生死锁。
Both the fcntl( ) and the
flock( ) system call may be used on
the same file at the same time, but a file locked through fcntl( ) does not appear locked to flock( ), and vice versa. This has been done
on purpose in order to avoid the deadlocks occurring when an
application using a type of lock relies on a library that uses the
other type.
处理强制文件锁有点复杂。以下是要遵循的步骤:
Handling mandatory file locks is a bit more complex. Here are the steps to follow:
使用mount-o mand命令中的选项
挂载需要强制锁定的文件系统,该选项
在
MS_MANDLOCKmount( ) 系统调用。默认情况下禁用强制锁定。
Mount the filesystem where mandatory locking is required
using the -o mand option in the
mount command, which sets the
MS_MANDLOCK flag in the
mount( ) system call. The default is to disable mandatory
locking.
通过设置组组位 (SGID) 并清除组执行权限位,将文件标记为强制锁定的候选者。因为当组执行位关闭时,设置组位没有任何意义,因此内核将该组合解释为使用强制锁而不是建议锁的提示。
Mark the files as candidates for mandatory locking by setting their set-group bit (SGID) and clearing the group-execute permission bit. Because the set-group bit makes no sense when the group-execute bit is off, the kernel interprets that combination as a hint to use mandatory locks instead of advisory ones.
使用fcntl( )系统调用(见下文)来获取或释放文件锁。
Uses the fcntl( ) system
call (see below) to get or release a file lock.
fcntl(
)处理租约比处理强制锁简单得多:使用F_SETLEASEor命令调用系统调用就足够了F_GETLEASE。该命令的另一次fcntl( )调用F_SETSIG可用于更改要发送到租用进程持有者的信号类型。
Handling leases is much simpler than handling mandatory locks:
it is sufficient to invoke a fcntl(
) system call with a F_SETLEASE or F_GETLEASE command. Another fcntl( ) invocation with the F_SETSIG command may be used to change the
type of signal to be sent to the lease process holder.
除了检查中read(
) 和write( )
系统调用时,内核在为所有可能修改文件内容的系统调用提供服务时会考虑强制锁的存在。例如,一个open( ) O_TRUNC如果文件存在任何强制锁定,则设置了标志的系统调用将失败。
Besides the checks in the read(
) and write( )
system calls, the kernel takes into consideration the
existence of mandatory locks when servicing all system calls that
could modify the contents of a file. For instance, an open( ) system call with the O_TRUNC flag set fails if any mandatory lock
exists for the file.
以下部分描述了内核用来处理通过flock( )系统调用(FL_FLOCKlocks)和fcntl( )系统调用(FL_POSIXlocks)发出的文件锁的主要数据结构。
The following section describes the main data structure used by
the kernel to handle file locks issued by means of the flock( ) system call (FL_FLOCK locks) and of the fcntl( ) system call (FL_POSIX locks).
所有类型的Linux锁都由相同的数据结构表示,
其字段如表12-19file_lock所示。
All type of Linux locks are represented by the same
file_lock data structure whose
fields are shown in Table
12-19.
表 12-19。file_lock数据结构的字段
Table 12-19. The fields of the file_lock data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 与 inode 关联的锁列表中的下一个元素 Next element in list of locks associated with the inode |
| | 活动列表或阻止列表的指针 Pointers for active or blocked list |
| | 锁的服务员列表的指针 Pointers for the lock's waiters list |
| | 拥有者 Owner's |
| | 进程所有者的PID PID of the process owner |
| | 阻塞进程的等待队列 Wait queue of blocked processes |
| | 指向文件对象的指针 Pointer to file object |
| | 锁定标志 Lock flags |
| | 锁型 Lock type |
| | 锁定区域的起始偏移量 Starting offset of locked region |
| | 锁定区域的结束偏移量 Ending offset of locked region |
| | 用于租赁中断通知 Used for lease break notifications |
无符号长 unsigned long | fl_break_time fl_break_time | 租期结束前剩余时间 Remaining time before end of lease |
结构文件_锁_操作 * struct file_lock_operations * | fl_ops fl_ops | 指向文件锁定操作的指针 Pointer to file lock operations |
结构锁管理器操作* struct lock_manager_operations * | 拖把 fl_mops | 指向锁管理器操作的指针 Pointer to lock manager operations |
| | 文件系统特定信息 Filesystem-specific information |
所有引用磁盘上同一文件的结构都收集在一个单链表中,其第一个元素由inode 对象的字段lock_file指向。i_flock该
结构fl_next的字段lock_file指定列表中的下一个元素。
All lock_file structures that
refer to the same file on disk are collected in a singly linked list,
whose first element is pointed to by the i_flock field of the inode object. The
fl_next field of the lock_file structure specifies the next
element in the list.
当进程发出阻塞系统调用来要求独占锁,而同一文件上存在共享锁时,锁请求无法立即得到满足,进程必须挂起。fl_wait因此,该进程被插入到阻塞锁结构的字段所指向的等待队列中file_lock。两个列表用于区分已满足的锁请求(活动锁 )来自那些不能立即满足的(阻塞锁 )。
When a process issues a blocking system call to require an
exclusive lock while there are shared locks on the same file, the lock
request cannot be satisfied immediately and the process must be
suspended. The process is thus inserted into a wait queue pointed to
by the fl_wait field of the blocked
lock's file_lock structure. Two
lists are used to distinguish lock requests that have been satisfied
(active locks ) from those that cannot be satisfied right away
(blocked locks ).
所有活动的锁都在“全局文件锁列表”中链接在一起,其头元素存储在变量中file_lock_list。类似地,所有阻塞的锁都在“阻塞列表”中链接在一起,其头元素存储在变量中blocked_list。该fl_link字段用于lock_file在这两个列表之一中插入结构。
All active locks are linked together in the "global file lock
list" whose head element is stored in the file_lock_list variable. Similarly, all
blocked locks are linked together in the "blocked list" whose head
element is stored in the blocked_list variable. The fl_link field is used to insert a lock_file structure in either one of these
two lists.
最后但并非最不重要的一点是,内核必须跟踪与给定的活动锁(“阻塞者”)关联的所有阻塞锁(“等待者”):这是一个将给定的所有等待者链接在一起的列表的目的。拦截器。blocker 的字段fl_block是列表的虚拟头,而fl_blockwaiter 的字段存储指向列表中相邻元素的指针。
Last but not least, the kernel must keep track of all blocked
locks (the "waiters") associated with a given active lock (the
"blocker"): this is the purpose of a list that links together all
waiters with respect to a given blocker. The fl_block field of the blocker is the dummy
head of the list, while the fl_block fields of the waiters store the
pointers to the adjacent elements in the list.
锁FL_FLOCK始终与文件对象关联,因此由打开该文件的进程(或共享同一打开文件的所有克隆进程)拥有。当请求并授予锁时,内核会用新锁替换进程在同一文件对象上持有的所有其他锁。仅当进程想要将已拥有的读锁更改为写锁时才会发生这种情况,反之亦然。此外,当函数释放文件对象时fput( ),FL_FLOCK引用该文件对象的所有锁都将被销毁。但是,其他进程可能FL_FLOCK对同一文件(inode)设置其他读锁,并且它们仍然保持活动状态。
An FL_FLOCK lock is
always associated with a file object and is thus owned by the process
that opened the file (or by all clone processes sharing the same
opened file). When a lock is requested and granted, the kernel
replaces every other lock that the process is holding on the same file
object with the new lock. This happens only when a process wants to
change an already owned read lock into a write one, or vice versa.
Moreover, when a file object is being freed by the fput( ) function, all FL_FLOCK locks that refer to the file object
are destroyed. However, there could be other FL_FLOCK read locks set by other processes
for the same file (inode), and they still remain active.
系统flock( )调用允许进程应用或删除打开文件上的咨询锁。它作用于两个参数:fd
要作用的文件的文件描述符和cmd指定锁定操作的参数。参数读时cmd需要
LOCK_SH共享锁,LOCK_EX写时需要独占锁,并LOCK_UN释放锁。[ * ]
The flock( ) system call
allows a process to apply or remove an advisory lock on an open file.
It acts on two parameters: the fd
file descriptor of the file to be acted upon and a cmd parameter that specifies the lock
operation. A cmd parameter of
LOCK_SH requires a shared lock for
reading, LOCK_EX requires an
exclusive lock for writing, and LOCK_UN releases the lock.[*]
通常,如果请求不能立即得到满足,则此系统调用会阻塞当前进程,例如,如果进程需要独占锁,而其他进程已经获取了相同的锁。但是,如果该标志与or运算LOCK_NB一起传递
,则系统调用不会阻塞;换句话说,如果不能立即获得锁,系统调用将返回错误代码。LOCK_SHLOCK_EX
Usually this system call blocks the current process if the
request cannot be immediately satisfied, for instance if the process
requires an exclusive lock while some other process has already
acquired the same lock. However, if the LOCK_NB flag is passed together with the
LOCK_SH or LOCK_EX operation, the system call does not
block; in other words, if the lock cannot be immediately obtained, the
system call returns an error code.
当sys_flock( )服务例程被调用时,它执行以下步骤:
When the sys_flock( ) service
routine is invoked, it performs the following steps:
检查是否fd是有效的文件描述符;如果不是,则返回错误代码。filp获取对应文件对象的地址。
Checks whether fd is a
valid file descriptor; if not, returns an error code. Gets the
address filp of the
corresponding file object.
检查进程是否对打开的文件具有读和/或写权限;如果不是,则返回错误代码。
Checks that the process has read and/or write permission on the open file; if not, returns an error code.
获取一个新file_lock
对象lock并以适当的方式初始化它:fl_type根据参数的值设置字段cmd,
将fl_file字段设置为filp文件对象的地址,将fl_flags字段设置为FL_FLOCK,fl_pid将字段设置为current->tgid,将fl_end字段设置-1为表示锁定是指整个文件(而不是其中的一部分)。
Gets a new file_lock
object lock and initializes it
in the appropriate way: the fl_type field is set according to the
value of the parameter cmd, the
fl_file field is set to the
address filp of the file
object, the fl_flags field is
set to FL_FLOCK, the fl_pid field is set to current->tgid, and the fl_end field is set to -1 to denote the fact that locking
refers to the whole file (and not to a portion of it).
如果cmd参数不包含该LOCK_NB位,则会将标志添加到fl_flags该字段
FL_SLEEP。
If the cmd parameter does
not include the LOCK_NB bit, it
adds to the fl_flags field the
FL_SLEEP flag.
如果文件有flock
文件操作,则例程调用它,并传递文件对象指针filp、标志(F_SETLKW或F_SETLK取决于位的值
LOCK_NB)和新file_lock对象
的地址作为其参数lock。
If the file has a flock
file operation, the routine invokes it, passing as its parameters
the file object pointer filp, a
flag (F_SETLKW or F_SETLK depending on the value of the
LOCK_NB bit), and the address
of the new file_lock object
lock.
否则,如果flock
未定义文件操作(常见情况),则调用flock_lock_file_wait( )以尝试执行所需的锁定操作。传递两个参数:
filp文件对象指针,以及步骤 3 中创建的lock新对象的地址file_lock。
Otherwise, if the flock
file operation is not defined (the common case), invokes flock_lock_file_wait( ) to try to
perform the required lock operation. Two parameters are passed:
filp, a file object pointer,
and lock, the address of the
new file_lock object created in
step 3.
如果file_lock
描述符尚未在上一步中插入到活动列表或阻止列表中,则例程将释放它。
If the file_lock
descriptor has not been inserted in the active or blocked lists in
the previous step, the routine releases it.
成功则返回 0。
Returns 0 in case of success.
该flock_lock_file_wait( )
函数执行一个由以下步骤组成的循环:
The flock_lock_file_wait( )
function executes a cycle consisting of the following steps:
调用flock_lock_file(
)将文件对象指针和新对象filp的地址作为参数传递。该函数依次执行以下操作:file_locklock
filp->f_dentry->d_inode->i_flock
搜索指向的列表。如果FL_FLOCK
找到同一文件对象的锁,则检查其类型(LOCK_SH或LOCK_EX):如果它等于新锁的类型,则返回 0(无需执行任何操作)。否则,该函数从索引节点上的锁列表和全局文件锁列表中删除旧元素,唤醒在列表中锁的等待队列中休眠的所有进程,并释放该
fl_block结构file_lock
。
如果进程正在执行解锁(LOCK_UN),则无需执行任何其他操作:锁不存在或已被释放,因此返回 0。
如果FL_FLOCK已找到同一文件对象的锁(因此该进程将已拥有的读锁更改为写锁(反之亦然)),则会给予其他一些更高优先级的进程,特别是先前在旧文件上阻塞的每个进程锁,有机会通过调用来运行cond_resched( )。
再次搜索 inode 上的锁列表,以验证现有FL_FLOCK
锁是否与请求的锁冲突。列表中不能有
写锁,而且,如果进程正在请求写锁,则根本FL_FLOCK不能有锁。FL_FLOCK
如果不存在冲突锁,则它将新
file_lock结构插入 inode 的锁列表和全局文件锁列表中,然后返回 0(成功)。
发现冲突锁:如果设置了字段中的标志,则将新锁(等待锁)插入到阻塞锁的循环链表和全局阻塞链表中FL_SLEEP。fl_flags
返回错误代码-EAGAIN。
Invokes flock_lock_file(
) passing as parameters the file object pointer filp and the address of the new file_lock object lock. This function performs, in turn,
the following operations:
Searches the list that filp->f_dentry->d_inode->i_flock
points to. If an FL_FLOCK
lock for the same file object is found, checks its type
(LOCK_SH or LOCK_EX): if it is equal to the type
of the new lock, returns 0 (nothing has to be done).
Otherwise, the function removes the old element from the list
of locks on the inode and the global file lock list, wakes up
all processes sleeping in the wait queues of the locks in the
fl_block list, and frees
the file_lock
structure.
If the process is performing an unlock (LOCK_UN), nothing else needs to be
done: the lock was nonexisting or it has already been
released, thus returns 0.
If an FL_FLOCK lock
for the same file object has been found—thus the process is
changing an already owned read lock into a write one (or vice
versa)—gives some other higher-priority process, in particular
every process previously blocked on the old file lock, a
chance to run by invoking cond_resched( ).
Searches the list of locks on the inode again to verify
that no existing FL_FLOCK
lock conflicts with the requested one. There must be no
FL_FLOCK write lock in the
list, and moreover, there must be no FL_FLOCK lock at all if the process
is requesting a write lock.
If no conflicting lock exists, it inserts the new
file_lock structure into
the inode's lock list and into the global file lock list, then
returns 0 (success).
A conflicting lock has been found: if the FL_SLEEP flag in the fl_flags field is set, it inserts
the new lock (the waiter lock) in the circular list of the
blocker lock and in the global blocked list.
Returns the error code -EAGAIN.
检查返回码flock_lock_file( ):
如果返回码为 0(外观无冲突),则返回 0(成功)。
存在不兼容性。如果FL_SLEEP该字段中的标志fl_flags被清除,则释放lock file_lock
描述符并返回-EAGAIN。
否则,存在不兼容性,但进程可以休眠:调用wait_event_interruptible( )将当前进程插入lock->fl_wait等待队列并挂起它。当进程被唤醒时(阻塞锁释放后),它会跳到步骤 1 重试该操作。
Checks the return code of flock_lock_file( ):
If the return code is 0 (no conflicting looks), it returns 0 (success).
There are incompatibilities. If the FL_SLEEP flag in the fl_flags field is cleared, it
releases the lock file_lock
descriptor and returns -EAGAIN.
Otherwise, there are incompatibilities but the process
can sleep: invokes wait_event_interruptible( ) to
insert the current process in the lock->fl_wait wait queue and to
suspend it. When the process is awakened (right after the
blocker lock has been released), it jumps to step 1 to retry
the operation.
锁FL_POSIX总是与进程和索引节点相关联;当进程终止或文件描述符关闭时(即使进程打开同一个文件两次或重复文件描述符),锁会自动释放。此外,FL_POSIX锁永远不会被子进程继承fork( )。
An FL_POSIX lock is
always associated with a process and with an
inode; the lock is automatically released either when the process dies
or when a file descriptor is closed (even if the process opened the
same file twice or duplicated a file descriptor). Moreover, FL_POSIX locks are never inherited by a
child across a fork( ).
当用于锁定文件时,fcntl(
)系统调用作用于三个参数:fd要作用的文件的文件描述符、cmd指定锁定操作的参数以及fl指向存储在用户态进程地址中的flock数据结构[ * ]的指针空间; 其字段说明如表12-20所示。
When used to lock files, the fcntl(
) system call acts on three parameters: the fd file descriptor of the file to be acted
upon, a cmd parameter that
specifies the lock operation, and an fl pointer to a flock data structure[*] stored in the User Mode process address space; its
fields are described in Table 12-20.
表 12-20。羊群数据结构的字段
Table 12-20. The fields of the flock data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | |
| | |
| | 锁定区域相对于值的初始偏移量 Initial offset of the locked region
relative to the value of |
| | 锁定区域的长度(0 表示该区域包括超过文件当前末尾的所有潜在写入) Length of locked region (0 means that the region includes all potential writes past the current end of the file) |
| | 所有者的PID PID of the owner |
服务sys_fcntl( )例程的行为有所不同,具体取决于参数中设置的标志的值cmd:
The sys_fcntl( ) service
routine behaves differently, depending on the value of the flag set in
the cmd parameter:
F_GETLKF_GETLK确定结构描述的锁是否与另一个进程已获得的flock某些锁冲突
。FL_POSIX在这种情况下,该flock结构将被有关现有锁的信息覆盖。
Determines whether the lock described by the flock structure conflicts with some
FL_POSIX lock already
obtained by another process. In this case, the flock structure is overwritten with
the information about the existing lock.
F_SETLKF_SETLK设置结构描述的锁flock。如果无法获取锁,系统调用将返回错误代码。
Sets the lock described by the flock structure. If the lock cannot be
acquired, the system call returns an error code.
F_SETLKWF_SETLKW设置结构描述的锁flock。如果无法获取锁,则系统调用阻塞;也就是说,调用进程将进入睡眠状态,直到锁可用为止。
Sets the lock described by the flock structure. If the lock cannot be
acquired, the system call blocks; that is, the calling process
is put to sleep until the lock is available.
F_GETLK64, F_SETLK64,F_SETLKW64F_GETLK64, F_SETLK64, F_SETLKW64与前面的相同,但flock64使用数据结构而不是flock.
Identical to the previous ones, but the flock64 data structure is used rather
than flock.
服务sys_fcntl( )例程首先获取与参数对应的文件对象fd,然后调用fcntl_getlk( )或fcntl_setlk( ),具体取决于作为其参数传递的命令(F_GETBLK
对于前一个函数,F_SETLK或
F_SETLKW对于后一个函数)。我们只考虑第二种情况。
The sys_fcntl( ) service
routine gets first a file object corresponding to the fd parameter and invokes then fcntl_getlk( ) or fcntl_setlk( ), depending on the command
passed as its parameter (F_GETBLK
for the former function, F_SETLK or
F_SETLKW for the latter one). We'll
consider the second case only.
该fcntl_setlk( )函数作用于三个参数:filp
指向文件对象的指针、cmd
命令(F_SETLK或F_SETLKW)和指向数据结构的指针flock。执行的步骤如下:
The fcntl_setlk( ) function
acts on three parameters: a filp
pointer to the file object, a cmd
command (F_SETLK or F_SETLKW), and a pointer to a flock data structure. The steps performed
are the following:
fl读取类型为局部变量中的参数
所指向的结构flock。
Reads the structure pointed to by the fl parameter in a local variable of type
flock.
检查该锁是否应该是强制的,并且该文件是否具有共享内存映射(请参阅第 16 章中的“内存映射”
一节)。在这种情况下,该函数拒绝创建锁并返回
错误代码,因为该文件已被另一个进程访问。-EAGAIN
Checks whether the lock should be a mandatory one and the
file has a shared memory mapping (see the section "Memory Mapping" in
Chapter 16). In this
case, the function refuses to create the lock and returns the
-EAGAIN error code, because the
file is already being accessed by another process.
初始化一个新的file_lock根据用户flock
结构的内容和文件 inode 中存储的文件大小
Initializes a new file_lock structure according to the
contents of the user's flock
structure and to the file size stored in the file's inode.
如果命令是F_SETLKW,则该函数FL_SLEEP在fl_flags字段中设置标志file_lock中设置标志。
If the command is F_SETLKW, the function sets the FL_SLEEP flag in the fl_flags field of the file_lock structure.
如果结构l_type体中的字段flock等于
F_RDLCK,则检查是否允许进程读取文件;类似地,如果l_type等于F_WRLCK,检查是否允许该进程写入文件。如果不是,则返回错误代码。
If the l_type field in
the flock structure is equal to
F_RDLCK, it checks whether the
process is allowed to read from the file; similarly, if l_type is equal to F_WRLCK, checks whether the process is
allowed to write into the file. If not, it returns an error
code.
Invokes the lock method
of the file operations, if defined. Usually for disk-based
filesystems , this method is not defined.
调用时将_ _posix_lock_file(
)文件 inode 对象的地址和对象的地址作为参数传递file_lock。该函数依次执行以下操作:
针对inode 锁列表中的posix_locks_conflict(
)每个锁调用。FL_POSIX该函数检查锁是否与请求的锁冲突。FL_POSIX本质上, inode 列表中的同一区域必须没有写锁,并且FL_POSIX如果进程正在请求写锁,则同一区域可能根本没有锁。然而,同一进程拥有的锁永远不会发生冲突;这允许进程更改其已拥有的锁的特征。
如果发现冲突锁,该函数将检查是否fcntl( )使用该F_SETLKW
命令调用。如果是这样,则必须挂起当前进程:调用
posix_locks_deadlock( )检查等待锁的进程之间是否没有产生死锁条件FL_POSIX,然后将新锁(等待锁)插入到冲突锁(阻塞锁)的阻塞列表中和阻止列表,最后返回错误码。否则,如果fcntl( )使用
F_SETLK命令调用,则返回错误代码。
一旦 inode 的锁列表不包含冲突锁,该函数就会检查FL_POSIX当前进程中与当前进程想要锁定的文件区域重叠的所有锁,并根据需要组合和分割相邻区域。例如,如果进程请求对位于读锁定的更宽区域内的文件区域进行写锁定,则先前的读锁定将被分成覆盖非重叠区域的两部分,而中央区域将受到新的写锁定的保护。如果发生重叠,新的锁总是会取代旧的锁。
将新file_lock结构插入全局文件锁定列表和 inode 列表中。
返回值 0(成功)。
Invokes _ _posix_lock_file(
) passing as parameters the address of the file's inode
object and the address of the file_lock object. This function
performs, in turn, the following operations:
Invokes posix_locks_conflict(
) for each FL_POSIX lock in the inode's lock
list. The function checks whether the lock conflicts with the
requested one. Essentially, there must be no FL_POSIX write lock for the same
region in the inode list, and there may be no FL_POSIX lock at all for the same
region if the process is requesting a write lock. However,
locks owned by the same process never conflict; this allows a
process to change the characteristics of a lock it already
owns.
If a conflicting lock is found, the function checks
whether fcntl( ) was
invoked with the F_SETLKW
command. If so, the current process must be suspended: invokes
posix_locks_deadlock( ) to
check that no deadlock condition is being created among
processes waiting for FL_POSIX locks, then inserts the new
lock (waiter lock) both in the blocker list of the conflicting
lock (blocker lock) and in the blocked list, and finally
returns an error code. Otherwise, if fcntl( ) was invoked with the
F_SETLK command, returns an
error code.
As soon as the inode's lock list includes no conflicting
lock, the function checks all the FL_POSIX locks of the current
process that overlap the file region that the current process
wants to lock, and combines and splits adjacent areas as
required. For example, if the process requested a write lock
for a file region that falls inside a read-locked wider
region, the previous read lock is split into two parts
covering the nonoverlapping areas, while the central region is
protected by the new write lock. In case of overlaps, newer
locks always replace older ones.
Inserts the new file_lock structure in the global
file lock list and in the inode list.
Returns the value 0 (success).
检查返回码_
_posix_lock_file( ):
如果返回码为 0(没有冲突锁),则返回 0(成功)。
存在不兼容性。如果该字段FL_SLEEP中的标志fl_flags被清除,则释放新的file_lock
描述符并返回-EAGAIN。
否则,如果存在不兼容性但进程可以休眠,则会调用wait_event_interruptible( )将当前进程插入lock->fl_wait等待队列并挂起它。当进程被唤醒时(阻塞锁释放后),它会跳到步骤 7 重试该操作。
Checks the return code of _
_posix_lock_file( ):
If the return code is 0 (no conflicting locks), it returns 0 (success).
There are incompatibilities. If the FL_SLEEP flag in the fl_flags field is cleared, it
releases the new file_lock
descriptor and returns -EAGAIN.
Otherwise, if there are incompatibilities but the
process can sleep, it invokes wait_event_interruptible( ) to
insert the current process in the lock->fl_wait wait queue and to
suspend it. When the process is awakened (right after the
blocker lock has been released), it jumps to step 7 to retry
the operation.
[ * ]奇怪的是,即使其他进程拥有文件的强制锁,进程仍然可以取消链接(删除)文件!这种令人困惑的情况是可能发生的,因为当进程删除文件硬链接时,它不会修改其内容,而只会修改其父目录的内容。
[*] Oddly enough, a process may still unlink (delete) a file even if some other process owns a mandatory lock on it! This perplexing situation is possible because when a process deletes a file hard link, it does not modify its contents, but only the contents of its parent directory.
[ * ]实际上,flock( )
系统调用也可以通过指定命令来建立共享模式强制锁LOCK_MAND。不过,我们不会进一步讨论这个案例。
[*] Actually, the flock( )
system call can also establish share-mode mandatory locks by
specifying the command LOCK_MAND. However, we'll not further
discuss this case.
[ * ] Linux 还定义了一个flock64结构体,它使用 64 位长整数作为offset
和length字段。下面,我们重点关注flock数据结构,但描述也适用flock64。
[*] Linux also defines a flock64 structure, which uses 64-bit
long integers for the offset
and length fields. In the
following, we focus on the flock data structure, but the
description is valid for flock64 too.
上一章中的虚拟文件系统依赖于较低级别的函数以适合每个设备的方式执行每个读、写或其他操作。上一章简要讨论了不同文件系统如何处理操作。在本章中,我们将了解内核如何调用实际设备上的操作。
The Virtual File System in the last chapter depends on lower-level functions to carry out each read, write, or other operation in a manner suited to each device. The previous chapter included a brief discussion of how operations are handled by different filesystems. In this chapter, we look at how the kernel invokes the operations on actual devices.
在“ I/O 架构”一节中,我们对 80 × 86 I/O 架构进行了简要概述。在“设备驱动程序模型”一节中,我们介绍Linux设备驱动程序模型。接下来,在“设备文件”一节中,我们将展示VFS如何将一个称为“设备文件”的特殊文件与每个不同的硬件设备关联起来,以便应用程序可以以相同的方式使用各种设备。然后我们在“设备驱动程序”部分介绍设备驱动程序的一些常见特征。最后,在“字符设备驱动程序”部分介绍Linux中字符设备驱动程序的整体组织。我们将把块设备驱动程序的讨论推迟到下一章。
In the section "I/O Architecture," we give a brief survey of the 80 × 86 I/O architecture. In the section "The Device Driver Model," we introduce the Linux device driver model. Next, in the section "Device Files," we show how the VFS associates a special file called "device file" with each different hardware device, so that application programs can use all kinds of devices in the same way. We then introduce in the section "Device Drivers" some common characteristics of device drivers. Finally, in the section "Character Device Drivers," we illustrate the overall organization of character device drivers in Linux. We'll defer the discussion of block device drivers to the next chapters.
对自己开发设备驱动程序感兴趣的读者可能需要参考 Jonathan Corbet、Alessandro Rubini 和 Greg Kroah-Hartman 的《Linux 设备驱动程序》第三版 (O'Reilly)。
Readers interested in developing device drivers on their own may want to refer to Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman's Linux Device Drivers, Third Edition (O'Reilly).
为了使计算机正常工作,必须提供数据路径,让信息在 CPU、RAM 和 I/O 设备之间流动可以连接到个人计算机。这些数据路径表示为总线 ,充当计算机内部的主要通信通道。
To make a computer work properly, data paths must be provided that let information flow between CPU(s), RAM, and the score of I/O devices that can be connected to a personal computer. These data paths, which are denoted as the buses , act as the primary communication channels inside the computer.
任何计算机都有一条连接大多数内部硬件设备的系统总线。典型的系统总线是 PCI(外围组件互连)总线。目前正在使用几种其他类型的总线,例如 ISA、EISA、MCA、SCSI 和 USB。通常,同一台计算机包含多个不同类型的总线,通过称为桥的硬件设备连接在一起 。两条高速总线专用于与存储芯片之间的数据传输:前端总线将 CPU 连接到 RAM 控制器,而 后端总线将 CPU 直接连接到外部硬件缓存。主桥将系统总线和前端总线连接在一起。
Any computer has a system bus that connects most of the internal hardware devices. A typical system bus is the PCI (Peripheral Component Interconnect) bus. Several other types of buses, such as ISA, EISA, MCA, SCSI, and USB, are currently in use. Typically, the same computer includes several buses of different types, linked together by hardware devices called bridges . Two high-speed buses are dedicated to the data transfers to and from the memory chips: the frontside bus connects the CPUs to the RAM controller, while the backside bus connects the CPUs directly to the external hardware cache. The host bridge links together the system bus and the frontside bus.
任何 I/O 设备都由一根且仅一根总线托管。总线类型影响 I/O 设备的内部设计,以及内核如何处理该设备。在本节中,我们讨论所有 PC 架构共有的功能特征,但不提供有关特定总线类型的详细信息。
Any I/O device is hosted by one, and only one, bus. The bus type affects the internal design of the I/O device, as well as how the device has to be handled by the kernel. In this section, we discuss the functional characteristics common to all PC architectures, without giving details about a specific bus type.
连接 CPU 和 I/O 设备的数据路径通常称为I/O 总线。80 × 86 微处理器使用 16 个地址引脚来寻址 I/O 设备,并使用 8、16 或 32 个数据引脚来传输数据。I/O 总线,在反过来,通过硬件组件的层次结构连接到每个 I/O 设备,最多包括三个元素: I/O 端口、接口和设备控制器。图 13-1显示了 I/O 架构的组件。
The data path that connects a CPU to an I/O device is generically called an I/O bus. The 80 × 86 microprocessors use 16 of their address pins to address I/O devices and 8, 16, or 32 of their data pins to transfer data. The I/O bus, in turn, is connected to each I/O device by means of a hierarchy of hardware components including up to three elements: I/O ports , interfaces, and device controllers. Figure 13-1 shows the components of the I/O architecture.
连接到 I/O 总线的每个设备都有自己的一组 I/O 地址,通常称为I/O 端口。在IBM PC架构中,I/O地址空间提供多达65,536个8位I/O端口。两个连续的 8 位端口可被视为单个 16 位端口,该端口必须以偶数地址开始。类似地,两个连续的 16 位端口可以被视为一个 32 位端口,它必须从 4 的倍数的地址开始。四个特殊的汇编语言指令称为in、ins ,out , 和outs
允许CPU读取和写入I/O端口。当执行这些指令之一时,CPU 选择所需的 I/O 端口并在 CPU 寄存器和端口之间传输数据。
Each device connected to the I/O bus has its own set of
I/O addresses, which are usually called I/O
ports. In the IBM PC architecture, the I/O
address space provides up to 65,536 8-bit I/O ports. Two
consecutive 8-bit ports may be regarded as a single 16-bit port, which
must start on an even address. Similarly, two consecutive 16-bit ports
may be regarded as a single 32-bit port, which must start on an
address that is a multiple of 4. Four special assembly language
instructions called in, ins , out , and outs
allow the CPU to read from and write into an I/O port.
While executing one of these instructions, the CPU selects the
required I/O port and transfers the data between a CPU register and
the port.
I/O端口也可以被映射到物理地址空间的地址。然后,处理器能够通过发出直接在内存上操作的汇编语言指令(例如, 、 、mov等
and)or来与 I/O 设备进行通信。现代硬件设备更适合映射 I/O,因为它速度更快并且可以与 DMA 结合使用。
I/O ports may also be mapped into addresses of the physical
address space. The processor is then able to communicate with an I/O
device by issuing assembly language instructions that operate directly
on memory (for instance, mov,
and, or, and so on). Modern hardware devices are
more suited to mapped I/O, because it is faster and can be combined
with DMA.
系统设计人员的一个重要目标是在不牺牲性能的情况下提供统一的 I/O 编程方法。为此,每个设备的 I/O 端口都被构造成一组专用寄存器,如图13-2所示。CPU 将要发送到设备的命令写入设备控制寄存器,并从设备状态寄存器读取表示设备内部状态的值。CPU 还通过从设备输入寄存器读取字节来从设备获取数据,并通过将字节写入设备输出寄存器来将数据推送到设备。
An important objective for system designers is to offer a unified approach to I/O programming without sacrificing performance. Toward that end, the I/O ports of each device are structured into a set of specialized registers, as shown in Figure 13-2. The CPU writes the commands to be sent to the device into the device control register and reads a value that represents the internal state of the device from the device status register. The CPU also fetches data from the device by reading bytes from the device input register and pushes data to the device by writing bytes into the device output register.
为了降低成本,相同的I/O端口通常用于不同的目的。例如,某些位描述设备状态,而其他位则指定要向设备发出的命令。类似地,相同的I/O端口可以用作输入寄存器或输出寄存器。
To lower costs, the same I/O port is often used for different purposes. For instance, some bits describe the device state, while others specify the command to be issued to the device. Similarly, the same I/O port may be used as an input register or an output register.
in、out、ins和汇编语言指令outs访问 I/O 端口。内核中包含以下辅助函数来简化此类访问:
The in, out, ins, and outs assembly language instructions access
I/O ports. The following auxiliary functions are included in the
kernel to simplify such accesses:
inb( ), inw( ),inl(
)inb( ), inw( ), inl(
)从 I/O 端口分别读取 1、2 或 4 个连续字节。后缀“b”、“w”或“l”分别指字节(8 位)、字(16 位)和长整型(32 位)。
Read 1, 2, or 4 consecutive bytes, respectively, from an I/O port. The suffix "b," "w," or "l" refers, respectively, to a byte (8 bits), a word (16 bits), and a long (32 bits).
inb_p( ), inw_p( ),inl_p( )inb_p( ), inw_p( ), inl_p( )分别从 I/O 端口读取 1、2 或 4 个连续字节,然后执行“虚拟”指令以引入暂停。
Read 1, 2, or 4 consecutive bytes, respectively, from an I/O port, and then execute a "dummy" instruction to introduce a pause.
outb( ), outw( ),outl( )outb( ), outw( ), outl( )分别将 1、2 或 4 个连续字节写入 I/O 端口。
Write 1, 2, or 4 consecutive bytes, respectively, to an I/O port.
outb_p( ), outw_p( ),outl_p( )outb_p( ), outw_p( ), outl_p( )分别将 1、2 和 4 个连续字节写入 I/O 端口,然后执行“虚拟”指令以引入暂停。
Write 1, 2, and 4 consecutive bytes, respectively, to an I/O port, and then execute a "dummy" instruction to introduce a pause.
insb( ), insw( ),insl( )insb( ), insw( ), insl( )从 I/O 端口分别以 1、2 或 4 为一组读取连续字节序列。序列的长度被指定为函数的参数。
Read sequences of consecutive bytes in groups of 1, 2, or 4, respectively, from an I/O port. The length of the sequence is specified as a parameter of the functions.
outsb( ), outsw( ),outsl( )outsb( ), outsw( ), outsl( )将连续字节序列(分别以 1、2 或 4 为一组)写入 I/O 端口。
Write sequences of consecutive bytes, in groups of 1, 2, or 4, respectively, to an I/O port.
虽然访问 I/O 端口很简单,但检测哪些 I/O 端口已分配给 I/O 设备可能并不容易,特别是对于基于 ISA 总线的系统。通常,设备驱动程序必须盲目地写入某些 I/O 端口来探测硬件设备;但是,如果该 I/O 端口已被其他硬件设备使用,则可能会发生系统崩溃。为了防止这种情况,内核通过“资源”来跟踪分配给每个硬件设备的I/O端口。”。
While accessing I/O ports is simple, detecting which I/O ports have been assigned to I/O devices may not be easy, in particular, for systems based on an ISA bus. Often a device driver must blindly write into some I/O port to probe the hardware device; if, however, this I/O port is already used by some other hardware device, a system crash could occur. To prevent such situations, the kernel keeps track of I/O ports assigned to each hardware device by means of "resources ."
资源表示可以专门分配给设备驱动程序的某个实体的一部分。在我们的例子中,资源代表一系列 I/O 端口地址。与每个资源相关的信息存储在一个数据结构中,其字段如表13-1resource所示。所有同类资源都插入树状数据结构中;例如,表示 I/O 端口地址范围的所有资源都包含在以节点 为根的树中
。ioport_resource
A resource represents a portion of some
entity that can be exclusively assigned to a device driver. In our
case, a resource represents a range of I/O port addresses. The
information relative to each resource is stored in a resource data structure, whose fields are
shown in Table
13-1. All resources of the same kind are inserted in a
tree-like data structure; for instance, all resources representing
I/O port address ranges are included in a tree rooted at the node
ioport_resource.
表 13-1。资源数据结构的字段
Table 13-1. The fields of the resource data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 资源所有者的描述 Description of owner of the resource |
| | 资源范围的开始 Start of the resource range |
| | 资源范围结束 End of the resource range |
| | 各种旗帜 Various flags |
| | 指向资源树中父级的指针 Pointer to parent in the resource tree |
| | 指向资源树中同级的指针 Pointer to a sibling in the resource tree |
| | 指向资源树中第一个子项的指针 Pointer to first child in the resource tree |
节点的子节点收集在列表中,该列表的第一个元素由字段指向child。该sibling字段指向列表中的下一个节点。
The children of a node are collected in a list whose first
element is pointed to by the child field. The sibling field points to the next node in
the list.
为什么要用树?好吧,例如,考虑 IDE 硬盘接口使用的 I/O 端口地址 — 假设从0xf000到0xf00f。start然后,将字段设置为0xf000并将end字段设置为 的资源0xf00f包含在树中,并将控制器的常规名称存储在该name字段中。0xf000然而,IDE设备驱动程序需要记住另一位信息,即从到的子范围0xf007用于IDE链的主盘,而从0xf008到的子范围0xf00f用于从盘。为此,设备驱动程序在对应于整个范围从0xf000到 的资源下方插入两个子项0xf00f,每个 I/O 端口子范围有一个子项。作为一般规则,树的每个节点必须对应于与父级关联的范围的子范围。I/O 端口资源树的根 ( ioport_resource) 跨越整个 I/O 地址空间(从端口号 0 到 65535)。
Why use a tree? Well, consider, for instance, the I/O port
addresses used by an IDE hard disk interface—let's say from 0xf000 to 0xf00f. A resource with the start field set to 0xf000 and the end field set to 0xf00f is then included in the tree, and
the conventional name of the controller is stored in the name field. However, the IDE device driver
needs to remember another bit of information, namely that the
subrange from 0xf000 to 0xf007 is used for the master disk of the
IDE chain, while the subrange from 0xf008 to 0xf00f is used for the slave disk. To do
this, the device driver inserts two children below the resource
corresponding to the whole range from 0xf000 to 0xf00f, one child for each subrange of I/O
ports. As a general rule, each node of the tree must correspond to a
subrange of the range associated with the parent. The root of the
I/O port resource tree (ioport_resource) spans the whole I/O
address space (from port number 0 to 65535).
每个设备驱动程序可以使用以下三个函数,将资源树的根节点和感兴趣的资源数据结构的地址传递给它们:
Each device driver may use the following three functions, passing to them the root node of the resource tree and the address of a resource data structure of interest:
request_resource(
)request_resource(
)将给定范围分配给 I/O 设备。
Assigns a given range to an I/O device.
allocate_resource(
)allocate_resource(
)在资源树中查找具有给定大小和对齐方式的可用范围;如果存在,则将该范围分配给 I/O 设备(主要由 PCI 设备的驱动程序使用,可以配置为使用任意端口号和板载内存地址)。
Finds an available range having a given size and alignment in the resource tree; if it exists, assigns the range to an I/O device (mainly used by drivers of PCI devices, which can be configured to use arbitrary port numbers and on-board memory addresses).
release_resource(
)release_resource(
)释放先前分配给 I/O 设备的给定范围。
Releases a given range previously assigned to an I/O device.
内核还定义了一些适用于 I/O 端口的上述函数的快捷方式:request_region(
)分配给定间隔的 I/O 端口和release_region( )释放先前分配的 I/O 端口间隔。当前分配给 I/O 设备的所有 I/O 地址树可以从/proc/ioports文件中获取。
The kernel also defines some shortcuts to the above functions
that apply to I/O ports: request_region(
) assigns a given interval of I/O ports and release_region( ) releases a previously
assigned interval of I/O ports. The tree of all I/O addresses
currently assigned to I/O devices can be obtained from the /proc/ioports file.
I/O接口是插在一组I/O端口和相应设备控制器之间的硬件电路。它充当解释器,将 I/O 端口中的值转换为设备的命令和数据。相反,它检测设备状态的变化并相应地更新充当状态寄存器角色的I/O端口。该电路还可以通过 IRQ 线连接到可编程中断控制器,以便它代表设备发出中断请求。
An I/O interface is a hardware circuit inserted between a group of I/O ports and the corresponding device controller. It acts as an interpreter that translates the values in the I/O ports into commands and data for the device. In the opposite direction, it detects changes in the device state and correspondingly updates the I/O port that plays the role of status register. This circuit can also be connected through an IRQ line to a Programmable Interrupt Controller, so that it issues interrupt requests on behalf of the device.
有两种类型的接口:
There are two types of interfaces:
致力于一种特定的硬件设备。在某些情况下,设备控制器位于 包含 I/O 接口的同一个卡 [ * ]中。连接到自定义 I/O 接口的设备可以是 内部设备(位于 PC 机柜内部的设备)或外部设备 (位于 PC 机柜外部的设备)。
Devoted to one specific hardware device. In some cases, the device controller is located in the same card [*] that contains the I/O interface. The devices attached to a custom I/O interface can be either internal devices (devices located inside the PC's cabinet) or external devices (devices located outside the PC's cabinet).
用于连接多个不同的硬件设备。连接到通用 I/O 接口的设备通常是外部设备。
Used to connect several different hardware devices. Devices attached to a general-purpose I/O interface are usually external devices.
为了了解自定义 I/O 接口(即当前安装在 PC 中的设备)所包含的多样性,我们将列出一些最常见的:
Just to give an idea of how much variety is encompassed by custom I/O interfaces—thus by the devices currently installed in a PC—we'll list some of the most commonly found:
连接到包含专用微处理器的键盘控制器。该微处理器对按键组合进行解码,生成中断,并将相应的扫描代码放入输入寄存器中。
Connected to a keyboard controller that includes a dedicated microprocessor. This microprocessor decodes the combination of pressed keys, generates an interrupt, and puts the corresponding scan code in an input register.
与相应的控制器一起封装在具有自己的帧缓冲区的图形卡中,以及专用处理器和存储在只读存储器芯片(ROM)中的一些代码。帧缓冲区是一个板载存储器,包含当前屏幕内容的描述。
Packed together with the corresponding controller in a graphic card that has its own frame buffer, as well as a specialized processor and some code stored in a Read-Only Memory chip (ROM). The frame buffer is an on-board memory containing a description of the current screen contents.
通过电缆连接到磁盘控制器,磁盘控制器通常与磁盘集成在一起。例如,IDE 接口通过 40 线扁平导体电缆连接到磁盘本身上的智能磁盘控制器。
Connected by a cable to the disk controller, which is usually integrated with the disk. For instance, the IDE interface is connected by a 40-wire flat conductor cable to an intelligent disk controller that can be found on the disk itself.
通过电缆连接到鼠标中包含的相应控制器。
Connected by a cable to the corresponding controller, which is included in the mouse.
与相应的控制器一起封装在网卡中,用于接收或传输网络数据包。尽管有多种广泛采用的网络标准,但以太网 (IEEE 802.3) 是最常见的。
Packed together with the corresponding controller in a network card used to receive or transmit network packets. Although there are several widely adopted network standards, Ethernet (IEEE 802.3) is the most common.
现代 PC 包括多个通用 I/O 接口,可连接各种外部设备。最常见的接口是:
Modern PCs include several general-purpose I/O interfaces , which connect a wide range of external devices. The most common interfaces are:
传统上用于连接打印机,它也可用于连接可移动磁盘、扫描仪、备份设备和其他计算机。数据一次传输 1 个字节(8 位)。
Traditionally used to connect printers, it can also be used to connect removable disks, scanners, backup units, and other computers. The data is transferred 1 byte (8 bits) at a time.
与并行端口类似,但数据一次传输 1 位。它包括一个通用异步接收器和发送器 (UART) 芯片,用于将要发送的字节串成位序列并将接收到的位重新组装成字节。由于其本质上比并行端口慢,因此该接口主要用于连接不高速运行的外部设备,例如调制解调器、鼠标和打印机。
Like the parallel port, but the data is transferred 1 bit at a time. It includes a Universal Asynchronous Receiver and Transmitter (UART) chip to string out the bytes to be sent into a sequence of bits and to reassemble the received bits into bytes. Because it is intrinsically slower than the parallel port, this interface is mainly used to connect external devices that do not operate at a high speed, such as modems, mouses, and printers.
主要包含在便携式计算机上。外部设备具有信用卡的形状,可以在不重新启动系统的情况下插入插槽或从插槽中取出。最常见的 PCMCIA 设备是硬盘、调制解调器、网卡和 RAM 扩展。
Included mostly on portable computers. The external device, which has the shape of a credit card, can be inserted into and removed from a slot without rebooting the system. The most common PCMCIA devices are hard disks, modems, network cards, and RAM expansions.
将主 PC 总线连接到称为SCSI 总线的辅助总线的电路。SCSI-2 总线最多允许连接八台 PC 和外部设备(硬盘、扫描仪、CD-ROM 刻录机等)。宽 SCSI-2 和 SCSI-3 接口允许您连接 16 个或更多设备(如果有其他接口)。SCSI 标准是用于通过 SCSI 总线连接设备的通信协议。
A circuit that connects the main PC bus to a secondary bus called the SCSI bus. The SCSI-2 bus allows up to eight PCs and external devices—hard disks, scanners, CD-ROM writers, and so on—to be connected. Wide SCSI-2 and the SCSI-3 interfaces allow you to connect 16 devices or more if additional interfaces are present. The SCSI standard is the communication protocol used to connect devices via the SCSI bus.
一种高速运行的通用 I/O 接口,可用于传统上连接到并行端口、串行端口和 SCSI 接口的外部设备。
A general-purpose I/O interface that operates at a high speed and may be used for the external devices traditionally connected to the parallel port, the serial port, and the SCSI interface.
复杂的设备可能需要设备控制器来驱动它。本质上,控制器扮演两个重要角色:
A complex device may require a device controller to drive it. Essentially, the controller plays two important roles:
它解释从 I/O 接口接收到的高级命令,并通过向设备发送正确的电信号序列来强制设备执行特定操作。
It interprets the high-level commands received from the I/O interface and forces the device to execute specific actions by sending proper sequences of electrical signals to it.
它转换并正确解释从设备接收到的电信号,并修改(通过 I/O 接口)状态寄存器的值。
It converts and properly interprets the electrical signals received from the device and modifies (through the I/O interface) the value of the status register.
典型的设备控制器是磁盘控制器,它从微处理器(通过 I/O 接口)接收诸如“写入此数据块”之类的高级命令,并将其转换为诸如“定位该数据块”之类的低级磁盘操作。磁盘磁头位于正确的磁道上”并“将数据写入磁道内部”。现代磁盘控制器非常复杂,因为它们可以将磁盘数据保存在板载快速磁盘缓存中并且可以重新排序针对实际磁盘几何结构优化的 CPU 高级请求。
A typical device controller is the disk controller, which receives high-level commands such as a "write this block of data" from the microprocessor (through the I/O interface) and converts them into low-level disk operations such as "position the disk head on the right track" and "write the data inside the track." Modern disk controllers are very sophisticated, because they can keep the disk data in on-board fast disk caches and can reorder the CPU high-level requests optimized for the actual disk geometry.
更简单的设备没有设备控制器;示例包括可编程中断控制器(请参见第 4 章中的“中断和异常”部分)和可编程间隔定时器(请参见第 6 章中的“可编程间隔定时器(PIT) ”部分)。
Simpler devices do not have a device controller; examples include the Programmable Interrupt Controller (see the section "Interrupts and Exceptions" in Chapter 4) and the Programmable Interval Timer (see the section "Programmable Interval Timer (PIT)" in Chapter 6).
多个硬件设备都包含自己的内存,通常称为I/O 共享内存 。例如,所有最新的显卡在帧缓冲区中都包含数十兆字节的 RAM,用于存储要在监视器上显示的屏幕图像。我们将在本章后面的“访问 I/O 共享内存”部分讨论 I/O 共享内存。
Several hardware devices include their own memory, which is often called I/O shared memory . For instance, all recent graphic cards include tens of megabytes of RAM in the frame buffer, which is used to store the screen image to be displayed on the monitor. We will discuss I/O shared memory in the section "Accessing the I/O Shared Memory" later in this chapter.
早期版本的 Linux 内核为设备驱动程序开发人员提供了一些基本功能:分配动态内存、保留一系列 I/O 地址或 IRQ 线、激活中断服务例程以响应设备的中断。事实上,较旧的硬件设备既笨重又难以编程,而且两个不同的硬件设备即使托管在同一总线上也几乎没有共同点。因此,尝试为设备驱动程序开发人员提供统一的模型是没有意义的。
Earlier versions of the Linux kernel offered few basic functionalities to the device driver developers: allocating dynamic memory, reserving a range of I/O addresses or an IRQ line, activating an interrupt service routine in response to a device's interrupt. Older hardware devices, in fact, were cumbersome and difficult to program, and two different hardware devices had little in common even if they were hosted on the same bus. Thus, there was no point in trying to offer a unifying model to the device driver developers.
现在情况不同了。PCI等总线类型对硬件设备的内部设计提出了很高的要求;因此,最新的硬件设备,即使是不同类别的硬件设备,也具有类似的功能。此类设备的驱动程序通常应注意:
Things are different now. Bus types such as PCI put strong demands on the internal design of the hardware devices; as a consequence, recent hardware devices, even of different classes, sport similar functionalities. Drivers for such devices should typically take care of:
电源管理(处理设备电源线上的不同电压电平)
Power management (handling of different voltage levels on the device's power line)
即插即用(配置设备时透明分配资源)
Plug and play (transparent allocation of resources when configuring the device)
热插拔(支持在系统运行时插入和拔出设备)
Hot-plugging (support for insertion and removal of the device while the system is running)
电源管理由内核在系统中的每个硬件设备上全局执行。例如,当电池供电的计算机进入“待机”状态时,内核必须强制每个硬件设备(硬盘、显卡、声卡、网卡、总线控制器等)处于低功耗状态。因此,可以置于“待机”状态的设备的每个驱动程序必须包括将硬件设备置于低功耗状态的回调函数。此外,必须按照精确的顺序将硬件设备置于“待机”状态,否则某些设备可能会处于错误的电源状态。例如,内核必须首先将硬盘置于“待机”状态,然后再将其磁盘控制器置于“待机”状态,
Power management is performed globally by the kernel on every hardware device in the system. For instance, when a battery-powered computer enters the "standby" state, the kernel must force every hardware device (hard disks, graphics card, sound card, network card, bus controllers, and so on) in a low-power state. Thus, each driver of a device that can be put in the "standby" state must include a callback function that puts the hardware device in the low-power state. Moreover, the hardware devices must be put in the "standby" state in a precise order, otherwise some devices could be left in the wrong power state. For instance, the kernel must put in "standby" first the hard disks and then their disk controller, because in the opposite case it would be impossible to send commands to the hard disks.
为了实现这些类型的操作,Linux 2.6 提供了一些数据结构和辅助函数,它们提供了系统中所有总线、设备和设备驱动程序的统一视图;这个框架称为设备驱动程序模型 。
To implement these kinds of operations, Linux 2.6 provides some data structures and helper functions that offer a unifying view of all buses, devices, and device drivers in the system; this framework is called the device driver model .
系统文件系统 filesystem是一个类似于/proc的特殊文件系统 通常安装在/sys目录中。/proc文件系统是第一个特殊的文件系统,旨在允许用户模式应用程序访问内核内部数据结构。/sysfs文件系统本质上具有相同的目标,但它提供了有关内核数据结构的附加信息;此外,/sysfs 的组织方式比/proc更加结构化。很可能,/proc和/sysfs在不久的将来将继续共存。
The sysfs filesystem is a special filesystem similar to /proc that is usually mounted on the /sys directory. The /proc filesystem was the first special filesystem designed to allow User Mode applications to access kernel internal data structures. The /sysfs filesystem has essentially the same objective, but it provides additional information on kernel data structures; furthermore, /sysfs is organized in a more structured way than /proc. Likely, both /proc and /sysfs will continue to coexist in the near future.
sysfs文件系统的目标是公开设备驱动程序模型的组件之间的层次关系。该文件系统的相关顶级目录是:
A goal of the sysfs filesystem is to expose the hierarchical relationships among the components of the device driver model. The related top-level directories of this filesystem are:
块设备,独立于它们所连接的总线。
The block devices, independently from the bus to which they are connected.
内核识别的所有硬件设备,根据它们连接的总线进行组织。
All hardware devices recognized by the kernel, organized according to the bus in which they are connected.
系统中承载设备的总线。
The buses in the system, which host the devices.
设备驱动程序在内核中注册。
The device drivers registered in the kernel.
系统中设备的类型(声卡、网卡、显卡等);同一类可能包括由不同总线托管并由不同驱动程序驱动的设备。
The types of devices in the system (audio cards, network cards, graphics cards, and so on); the same class may include devices hosted by different buses and driven by different drivers.
用于处理某些硬件设备的电源状态的文件。
Files to handle the power states of some hardware devices.
用于处理某些硬件设备的固件的文件。
Files to handle the firmware of some hardware devices.
设备驱动程序模型的组件之间的关系在sysfs文件系统中表示为目录和文件之间的符号链接。例如,/sys/block/sda/device文件可以是嵌套在/sys/devices/pci0000:00中的子目录的符号链接,表示连接到 PCI 总线的 SCSI 控制器。此外,/sys/block/sda/device/block文件是/sys/block/sda的符号链接,表明该 PCI 设备是 SCSI 磁盘的控制器。
Relationships between components of the device driver models are expressed in the sysfs filesystem as symbolic links between directories and files. For example, the /sys/block/sda/device file can be a symbolic link to a subdirectory nested in /sys/devices/pci0000:00 representing the SCSI controller connected to the PCI bus. Moreover, the /sys/block/sda/device/block file is a symbolic link to /sys/block/sda, stating that this PCI device is the controller of the SCSI disk.
sysfs文件系统中常规文件的主要作用是表示驱动程序和设备的属性。例如,/sys/block/hda目录中的dev文件包含第一个 IDE 链中主盘的主设备号和次设备号。
The main role of regular files in the sysfs filesystem is to represent attributes of drivers and devices. For instance, the dev file in the /sys/block/hda directory contains the major and minor numbers of the master disk in the first IDE chain.
设备驱动程序模型的核心数据结构是一个名为kobject的通用数据结构,它本质上与sysfs文件系统相关:每个 kobject 对应于该文件系统中的一个目录。
The core data structure of the device driver model is a generic data structure named kobject, which is inherently tied to the sysfs filesystem: each kobject corresponds to a directory in that filesystem.
Kobject 嵌入到更大的对象(即所谓的“容器”)中,这些对象描述了设备驱动程序模型的组件。[ * ]总线、设备和驱动程序的描述符是容器的典型例子;例如,第一个IDE磁盘中第一个分区的描述符对应于/sys/block/hda/hda1目录。
Kobjects are embedded inside larger objects—the so-called "containers"—that describe the components of the device driver model.[*] The descriptors of buses, devices, and drivers are typical examples of containers; for instance, the descriptor of the first partition in the first IDE disk corresponds to the /sys/block/hda/hda1 directory.
将 kobject 嵌入到容器中允许内核:
Embedding a kobject inside a container allows the kernel to:
保留容器的参考计数器
Keep a reference counter for the container
维护分层列表或容器集(例如,与块设备关联的sysfs目录包括每个磁盘分区的不同子目录)
Maintain hierarchical lists or sets of containers (for instance, a sysfs directory associated with a block device includes a different subdirectory for each disk partition)
提供容器属性的用户模式视图
Provide a User Mode view for the attributes of the container
一个kobject由一个数据结构表示kobject,其字段列于表13-2中。
A kobject is represented by a kobject data structure, whose fields are
listed in Table
13-2.
表 13-2。kobject数据结构的字段
Table 13-2. The fields of the kobject data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向保存容器名称的字符串的指针 Pointer to a string holding the name of the container |
| | 保存容器名称的字符串(如果适合 20 个字节) String holding the name of the container, if it fits in 20 bytes |
| | 容器的参考计数器 The reference counter for the container |
| | 插入 kobject 的列表的指针 Pointers for the list in which the kobject is inserted |
| | 指向父 kobject 的指针(如果有) Pointer to the parent kobject, if any |
| | 指向包含kset的指针 Pointer to the containing kset |
| | 指向 kobject 类型描述符的指针 Pointer to the kobject type descriptor |
| | 指向与 kobject 关联的sysfs文件的 dentry 的指针 Pointer to the dentry of the sysfs file associated with the kobject |
该ktype字段指向一个
kobj_type表示 kobject“类型”的对象——本质上是包含 kobject 的容器的类型。该kobj_type数据结构包括三个字段:一个release方法(在释放 kobject 时执行)、一个指向sysfssysfs_ops操作表的指针
以及sysfs文件系统的默认属性列表。
The ktype field points to a
kobj_type object representing the
"type" of the kobject—essentially, the type of the container that
includes the kobject. The kobj_type data structure includes three
fields: a release method
(executed when the kobject is being freed), a sysfs_ops pointer to a table of
sysfs operations, and a list of default
attributes for the sysfs filesystem.
该字段是
由单个字段组成的kref类型结构
。顾名思义,该字段是 kobject 的引用计数器,但它也可以充当 kobject 容器的引用计数器。和函数
分别增加和减少参考计数器;如果计数器达到零,则释放 kobject 使用的资源并执行 kobject 的对象的方法。此方法通常仅在动态分配 kobject 的容器时才定义,它会释放容器本身。k_refrefcountkobject_get(
)kobject_put( )releasekobj_type
The kref field is a
structure of type k_ref
consisting of a single refcount
field. As the name implies, this field is the reference counter for
the kobject, but it may act also as the reference counter for the
container of the kobject. The kobject_get(
) and kobject_put( )
functions increase and decrease, respectively, the reference
counter; if the counter reaches the value zero, the resources used
by the kobject are released and the release method of the kobj_type object of the kobject is
executed. This method, which is usually defined only if the
container of the kobject was allocated dynamically, frees the
container itself.
kobjects 可以通过ksets组织在层次树中 。kset 是相同类型的 kobject 的集合,即包含在相同类型的容器中。数据结构的字段如表13-3kset所示。
The kobjects can be organized in a hierarchical tree by means
of ksets . A kset is a collection of kobjects of the same
type—that is, included in the same type of container. The fields of
the kset data structure are
listed in Table
13-3.
表 13-3。kset数据结构的字段
Table 13-3. The fields of the kset data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向子系统描述符的指针 Pointer to the subsystem descriptor |
| | 指向kset的kobject类型描述符的指针 Pointer to the kobject type descriptor of the kset |
| | kset中包含的kobject列表的头部 Head of the list of kobjects included in the kset |
| | 嵌入式 kobject(见正文) Embedded kobject (see text) |
| | 指向用于 kobject 过滤和热插拔的回调函数表的指针 Pointer to a table of callback functions for kobject filtering and hot-plugging |
该list字段是kset中包含的kobject双向链表的头;该ktype字段指向kobj_typekset 中所有 kobject 共享的同一个描述符。
The list field is the head
of the doubly linked circular list of kobjects included in the kset;
the ktype field points to the
same kobj_type descriptor shared
by all kobjects in the kset.
字段kobj是嵌入在数据结构中的kobject kset;kset 中包含的 kobject 的字段parent指向该嵌入的 kobject。因此,kset是kobject的集合,但它依赖于更高级别的kobject来进行引用计数和分层树中的链接。这种设计选择是代码高效的并且允许最大的灵活性。例如,kset_get( )和kset_put( )函数分别增加和减少 kset 的引用计数器,只需在嵌入的 kobject 上调用kobject_get( )和
即可;kobject_put( )因为 kset 的引用计数器仅仅是 kset 的引用计数器kobj
kobject嵌入在kset中。此外,由于嵌入了kobject,kset数据结构可以嵌入到“容器”对象中,就像kobject数据结构一样。最后,一个 kset 可以成为另一个 kset 的成员:只需将嵌入的 kobject 插入到更高级别的 kset 中即可。
The kobj field is a kobject
embedded in the kset data
structure; the parent field of
the kobjects contained in the kset points to this embedded kobject.
Thus, a kset is a collection of kobjects, but it relies on a kobject
of higher level for reference counting and linking in the
hierarchical tree. This design choice is code-efficient and allows
the greatest flexibility. For instance, the kset_get( ) and kset_put( ) functions, which increase and
decrease respectively the reference counter of the kset, simply
invoke kobject_get( ) and
kobject_put( ) on the embedded
kobject; because the reference counter of a kset is merely the
reference counter of the kobj
kobject embedded in the kset. Moreover, thanks to the embedded
kobject, the kset data structure
can be embedded in a "container" object, exactly as for the kobject data structure. Finally, a kset
can be made a member of another kset: it suffices to insert the
embedded kobject in the higher-level kset.
称为子系统的 kset 集合
也存在。一个子系统可能包含不同类型的 kset,它由subsystem只有两个字段的数据结构表示:
Collections of ksets called subsystems
also exist. A subsystem may include ksets of
different types, and it is represented by a subsystem data structure having just two
fields:
ksetkset一个嵌入式kset,存储子系统中包含的kset
An embedded kset that stores the ksets included in the subsystem
rwsemrwsem读写信号量,保护子系统中递归包含的所有 kset 和 kobject
A read-write semaphore that protects all ksets and kobjects recursively included in the subsystem
甚至subsystem数据结构也可以嵌入到更大的“容器”对象中;因此,容器的引用计数器就是嵌入式子系统的引用计数器,即子系统中嵌入的 kset 中嵌入的 kobject 的引用计数器。和函数分别增加subsys_get( )和subsys_put( )减少该参考计数器。
Even the subsystem data
structure can be embedded in a larger "container" object; the
reference counter of the container is thus the reference counter of
the embedded subsystem—that is, the reference counter of the kobject
embedded in the kset embedded in the subsystem. The subsys_get( ) and subsys_put( ) functions respectively
increase and decrease this reference counter.
图 13-3 说明了设备驱动程序模型层次结构的示例。总线 子系统包括 pci子系统,pci 子系统又包括驱动程序kset。该 kset 包含一个串行 kobject(对应于串行端口的设备驱动程序),具有单个new-id 属性。
Figure 13-3 illustrates an example of the device driver model hierarchy. The bus subsystem includes a pci subsystem, which, in turn, includes a drivers kset. This kset contains a serial kobject—corresponding to the device driver for the serial port—having a single new-id attribute.
作为一般规则,如果您希望 kobject、kset 或子系统出现在sysfs子树中,则必须首先注册它。与 kobject 关联的目录始终出现在父 kobject 的目录中。例如,包含在同一kset中的kobject的目录出现在kset本身的目录中。因此, sysfs子树的结构代表了各种注册的kobject之间的层次关系,从而代表了各种容器对象之间的层次关系。通常, sysfs文件系统的顶级目录 与已注册的子系统相关联。
As a general rule, if you want a kobject, kset, or subsystem to appear in the sysfs subtree, you must first register it. The directory associated with a kobject always appears in the directory of the parent kobject. For instance, the directories of kobjects included in the same kset appear in the directory of the kset itself. Therefore, the structure of the sysfs subtree represents the hierarchical relationships between the various registered kobjects and, consequently, between the various container objects. Usually, the top-level directories of the sysfs filesystem are associated with the registered subsystems.
该kobject_register( )
函数初始化一个kobject并将相应的目录添加到sysfs文件系统中。在调用它之前,调用者应该设置kset
kobject 中的字段,以便它指向父 kset(如果有)。该函数从sysfskobject_unregister( )文件系统
中删除 kobject 的目录
。为了让内核开发人员的工作更轻松,Linux 还提供了和函数以及
和
函数,但它们本质上是
和 的包装函数。kset_register( )kset_unregister( )subsystem_register( )subsystem_unregister( )kobject_register( )kobject_unregister( )
The kobject_register( )
function initializes a kobject and adds the corresponding directory
to the sysfs filesystem. Before invoking it,
the caller should set the kset
field in the kobject so that it points to the parent kset, if any.
The kobject_unregister( )
function removes a kobject's directory from the
sysfs filesystem. To make life easier for
kernel developers, Linux also offers the kset_register( ) and kset_unregister( ) functions, and the
subsystem_register( ) and
subsystem_unregister( )
functions, but they are essentially wrapper functions around
kobject_register( ) and kobject_unregister( ).
如前所述,许多 kobject 目录包含称为属性的常规文件 。该 sysfs_create_file( ) 函数接收 kobject 的地址和属性描述符作为其参数,并在正确的目录中创建特殊文件。sysfs文件系统中表示的对象之间的其他关系是通过符号链接建立的:该sysfs_create_link()函数为与另一个 kobject 关联的目录中的给定 kobject 创建符号链接。
As stated before, many kobject directories include regular
files called attributes . The sysfs_create_file( ) function
receives as its parameters the addresses of a kobject and an
attribute descriptor, and creates the special file in the proper
directory. Other relationships between the objects
represented in the sysfs filesystem are
established by means of symbolic links: the sysfs_create_link() function creates a
symbolic link for a given kobject in a directory associated with
another kobject.
设备驱动程序模型建立在一些基本数据结构之上,这些数据结构表示总线、设备、设备驱动程序等。让我们检查一下它们。
The device driver model is built upon a handful of basic data structures, which represent buses, devices, device drivers, etc. Let us examine them.
设备驱动模型中的每个设备都由一个device对象来表示,其字段如表13-4所示。
Each device in the device driver model is represented
by a device object, whose fields
are shown in Table
13-4.
表 13-4。设备对象的字段
Table 13-4. The fields of the device object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 兄弟设备列表的指针 Pointers for the list of sibling devices |
| | 相同总线类型上的设备列表的指针 Pointers for the list of devices on the same bus type |
| | 驱动程序设备列表的指针 Pointers for the driver's list of devices |
| | 儿童设备清单的头条 Head of the list of children devices |
| | 指向父设备的指针 Pointer to the parent device |
| | 嵌入式对象 Embedded kobject |
| | 托管总线上的设备位置 Device position on the hosting bus |
| | 指向托管总线的指针 Pointer to the hosting bus |
| | 指向控制设备驱动程序的指针 Pointer to the controlling device driver |
| | 指向驱动程序私有数据的指针 Pointer to private data for the driver |
| | 指向遗留设备驱动程序私有数据的指针 Pointer to private data for legacy device drivers |
| | 电源管理信息 Power management information |
无符号长 unsigned long | 分离状态 detach_state | 卸载设备驱动程序时要进入的电源状态 Power state to be entered when unloading the device driver |
无符号长长 * unsigned long long * | DMA掩码 dma_mask | 指向设备的 DMA 掩码的指针(参见后面的“直接内存访问(DMA) ”部分) Pointer to the DMA mask of the device (see the later section "Direct Memory Access (DMA)") |
无符号长长 unsigned long long | 相干_DMA_掩码 coherent_dma_mask | 设备的相干 DMA 掩码 Mask for coherent DMA of the device |
结构列表头 struct list_head | DMA_池 dma_pools | 聚合 DMA 缓冲区列表的头部 Head of a list of aggregate DMA buffers |
结构 dma_coherent_mem * struct dma_coherent_mem * | DMA内存 dma_mem | 指向设备使用的相干 DMA 内存描述符的指针(参见后面的“直接内存访问(DMA) ”部分) Pointer to a descriptor of the coherent DMA memory used by the device (see the later section "Direct Memory Access (DMA)") |
void (*)(结构设备*) void (*)(struct device *) | 发布 release | 释放设备描述符的回调函数 Callback function for releasing the device descriptor |
这些device对象在子系统中全局收集,该子系统与/sys/devicesdevices_subsys目录关联(请参阅前面的“ Kobjects ”部分)。设备按层次结构组织:如果子设备在没有父设备的情况下无法正常工作,则设备是某些“子”设备的“父”设备。例如,在基于 PCI 的计算机中,PCI 总线和 USB 总线之间的桥接器是 USB 总线上托管的每个设备的父设备。该对象的字段指向父设备的描述符,该
字段是子设备列表的头部,parentdevicechildrennode字段存储指向子列表中相邻元素的指针。嵌入在对象中的 kobject 之间的父关系device也反映了设备层次结构;因此, /sys/devices下面的目录结构与硬件设备的物理组织相匹配。
The device objects are
globally collected in the devices_subsys subsystem, which is
associated with the /sys/devices directory (see the earlier
section "Kobjects"). The
devices are organized hierarchically: a device is the "parent" of
some "children" devices if the children devices cannot work properly
without the parent device. For instance, in a PCI-based computer, a
bridge between the PCI bus and the USB bus is the parent device of
every device hosted on the USB bus. The parent field of the device object points to the descriptor of
the parent device, the children
field is the head of the list of children devices, and the node field stores the pointers to the
adjacent elements in the children list. The parenthood relationships
between the kobjects embedded in the device objects reflect also the device
hierarchy; thus, the structure of the directories below /sys/devices matches the physical
organization of the hardware devices.
每个驱动程序都保存一个对象列表,device包括所有托管设备;driver_list对象的字段存储device指向相邻元素的指针,而driver字段则指向设备驱动程序的描述符。此外,对于每种总线类型,都有一个列表,其中包括给定类型的总线上托管的所有设备;bus_list对象的字段
存储device指向相邻元素的指针,而bus字段则指向总线类型描述符。
Each driver keeps a list of device objects including all managed
devices; the driver_list field of
the device object stores the
pointers to the adjacent elements, while the driver field points to the descriptor of
the device driver. For each bus type, moreover, there is a list
including all devices that are hosted on the buses of the given
type; the bus_list field of the
device object stores the pointers
to the adjacent elements, while the bus field points to the bus type
descriptor.
引用计数器跟踪对象的使用情况device;它包含
kobj在描述符中嵌入的 kobject 中。计数器通过调用而增加get_device( ),通过调用而减少put_device( )。
A reference counter keeps track of the usage of the device object; it is included in the
kobj kobject embedded in the
descriptor. The counter is increased by invoking get_device( ), and it is decreased by
invoking put_device( ).
该函数
在设备驱动模型中device_register( )
插入一个新对象,并自动在/sys/devices下为其创建一个新目录device 。相反,该device_unregister( )函数从设备驱动程序模型中删除设备。
The device_register( )
function inserts a new device
object in the device driver model, and automatically creates a new
directory for it under /sys/devices . Conversely, the device_unregister( ) function removes a
device from the device driver model.
通常,device对象静态地嵌入到更大的描述符中。例如,PCI设备是通过pci_dev
数据结构来描述的;该结构体的字段dev是一个device
对象,而其他字段是特定于PCI总线的。和
函数在设备在 PCI 内核层注册或注销时执行device_register( )。device_unregister( )
Usually, the device object
is statically embedded in a larger descriptor. For instance, PCI
devices are described by pci_dev
data structures; the dev field of
this structure is a device
object, while the other fields are specific to the PCI bus. The
device_register( ) and device_unregister( ) functions are
executed when the device is being registered or de-registered in the
PCI kernel layer.
设备驱动模型中的每个驱动都由一个对象来描述device_driver,其字段列于表13-5中。
Each driver in the device driver model is described by
a device_driver object, whose
fields are listed in Table 13-5.
表 13-5。device_driver对象的字段
Table 13-5. The fields of the device_driver object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 设备驱动程序的名称 Name of the device driver |
| | 指向托管受支持设备的总线描述符的指针 Pointer to descriptor of the bus that hosts the supported devices |
| | 信号量禁止设备驱动程序卸载;当参考计数器达到零时它被释放 Semaphore to forbid device driver unloading; it is released when the reference counter reaches zero |
| | 嵌入式对象 Embedded kobject |
| | 列表头,包括驱动程序支持的所有设备 Head of the list including all devices supported by the driver |
结构模块 * struct module * | 所有者 owner | 标识实现设备驱动程序的模块(如果有)(请参阅附录 B) Identifies the module that implements the device driver, if any (see Appendix B) |
| | 探测设备的方法(检查设备驱动程序是否可以处理该设备) Method for probing a device (checking that it can be handled by the device driver) |
| | 设备被移除时调用的方法 Method invoked on a device when it is removed |
| | 设备断电(关闭)时调用的方法 Method invoked on a device when it is powered off (shut down) |
| | 当设备处于低功耗状态时在设备上调用的方法 Method invoked on a device when it is put in low-power state |
| | 当设备恢复正常状态(全功率)时调用的方法 Method invoked on a device when it is put back in the normal state (full power) |
该device_driver对象包括处理热插拔、即插即用和电源管理的四种方法。probe
每当总线设备驱动程序发现可能由该驱动程序处理的设备时,就会调用该方法;相应的函数应该探测硬件以对设备进行进一步的检查。remove每当可热插拔设备被移除时,都会调用该方法;当驱动程序本身被卸载时,它也会在驱动程序处理的每个设备上调用。当内核必须更改其电源状态时,会在设备上调用 、shutdown和
suspend方法。resume
The device_driver object
includes four methods for handling hot-plugging, plug and play, and
power management. The probe
method is invoked whenever a bus device driver discovers a device
that could possibly be handled by the driver; the corresponding
function should probe the hardware to perform further checks on the
device. The remove method is
invoked on a hot-pluggable device whenever it is removed; it is also
invoked on every device handled by the driver when the driver itself
is unloaded. The shutdown,
suspend, and resume methods are invoked on a device
when the kernel must change its power state.
描述符中嵌入的 kobject 中包含的引用计数器kobj跟踪对象的使用情况device_driver。计数器通过调用而增加get_driver(
),通过调用而减少put_driver( )。
The reference counter included in the kobj kobject embedded in the descriptor
keeps track of the usage of the device_driver object. The counter is
increased by invoking get_driver(
), and it is decreased by invoking put_driver( ).
该函数在设备驱动程序模型中
driver_register( )
插入一个新对象,并自动在sysfs文件系统中为其创建一个新目录。相反,该函数从设备驱动程序模型中删除驱动程序。device_driverdriver_unregister( )
The driver_register( )
function inserts a new device_driver object in the device driver
model, and automatically creates a new directory for it in the
sysfs filesystem. Conversely, the driver_unregister( ) function removes a
driver from the device driver model.
通常,device_driver
对象静态地嵌入到更大的描述符中。例如,PCI 设备驱动程序pci_driver由数据结构描述;该结构体的字段driver是一个
device_driver对象,而其他字段是特定于PCI总线的。
Usually, the device_driver
object is statically embedded in a larger descriptor. For instance,
PCI device drivers are described by pci_driver data structures; the driver field of this structure is a
device_driver object, while the
other fields are specific to the PCI bus.
内核支持的每种总线类型都由一个bus_type对象来描述,其字段列于表 13-6中。
Each bus type supported by the kernel is described by
a bus_type object, whose fields
are listed in Table
13-6.
表 13-6。Bus_type 对象的字段
Table 13-6. The fields of the bus_type object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 总线类型名称 Name of the bus type |
| | 与该总线类型关联的 Kobject 子系统 Kobject subsystem associated with this bus type |
| | 驱动程序的 kobject 集合 The set of kobjects of the drivers |
| | 设备的 kobject 集合 The set of kobjects of the devices |
结构总线_属性 * struct bus_attribute * | 总线属性 bus_attrs | 指向对象的指针,包括总线属性以及将它们导出到sysfs文件系统的方法 Pointer to the object including the bus attributes and the methods for exporting them to the sysfs filesystem |
结构设备属性 * struct device_attribute * | 开发属性 dev_attrs | 指向对象的指针,包括设备属性以及将它们导出到sysfs文件系统的方法 Pointer to the object including the device attributes and the methods for exporting them to the sysfs filesystem |
结构驱动程序属性* struct driver_attribute * | 驱动属性 drv_attrs | 指向对象的指针,包括设备驱动程序属性以及将它们导出到sysfs 文件系统的方法 Pointer to the object including the device driver attributes and the methods for exporting them to the sysfs filesystem |
| | 检查给定驱动程序是否支持给定设备的方法 Method for checking whether a given driver supports a given device |
| | 设备注册时调用的方法 Method invoked when a device is being registered |
| | 用于保存硬件上下文状态并改变设备的功率级别的方法 Method for saving the hardware context state and changing the power level of a device |
| | 改变设备的功率水平和恢复硬件上下文的方法 Method for changing the power level and restoring the hardware context of a device |
每个bus_type对象都包含一个嵌入式子系统;存储在变量中的子系统bus_subsys收集了对象中嵌入的所有子系统bus_type。子系统与/sys/busbus_subsys目录关联;因此,例如,存在
与 PCI 总线类型关联的/sys/bus/pci目录。每总线子系统通常只包含两个名为驱动程序
和设备的 kset (分别对应于对象的和字段)。driversdevicesbus_type
Each bus_type object
includes an embedded subsystem; the subsystem stored in the bus_subsys variable collects all
subsystems embedded in the bus_type objects. The bus_subsys subsystem is associated with
the /sys/bus directory; thus,
for example, there exists a /sys/bus/pci
directory associated with the PCI bus type. The per-bus subsystem
typically includes only two ksets named drivers
and devices (corresponding to the drivers and devices fields of the bus_type object, respectively).
ksetdrivers包含device_driver与总线类型有关的所有设备驱动程序的描述符,而ksetdevices包含给device定总线类型的所有设备的描述符。由于设备的 kobject 目录已出现在/sys/devices下的
sysfs文件系统中,因此每总线子系统的 devices 目录存储指向/sys/devices下目录的符号链接。和函数分别迭代驱动程序和设备列表的元素。
bus_for_each_drv( )bus_for_each_dev( )
The drivers kset contains
the device_driver descriptors of
all device drivers pertaining to the bus type, while the devices kset contains the device descriptors of all devices of the
given bus type. Because the directories of the devices' kobjects
already appear in the sysfs filesystem under
/sys/devices, the devices
directory of the per-bus subsystem stores symbolic links pointing to
directories under /sys/devices.
The bus_for_each_drv( ) and
bus_for_each_dev( ) functions
iterate over the elements of the lists of drivers and devices,
respectively.
match当内核必须检查给定设备是否可以由给定驱动程序处理时,将执行该方法。即使每个设备的标识符具有特定于托管该设备的总线的格式,实现该方法的函数通常也很简单,因为它在驱动程序支持的标识符表中搜索设备的标识符。该hotplug方法在设备注册到设备驱动模型时执行;实现函数应该添加特定于总线的信息,作为环境变量传递给用户模式程序,该程序会收到有关新可用设备的通知(请参阅后面的“设备驱动程序注册”部分)。最后,suspendresume当给定类型总线上的设备必须更改其电源状态时,将执行和执行方法。
The match method is
executed when the kernel must check whether a given device can be
handled by a given driver. Even if each device's identifier has a
format specific to the bus that hosts the device, the function that
implements the method is usually simple, because it searches the
device's identifier in the driver's table of supported identifiers.
The hotplug method is executed
when a device is being registered in the device driver model; the
implementing function should add bus-specific information to be
passed as environment variables to a User Mode program that is
notified about the new available device (see the later section
"Device Driver
Registration"). Finally, the suspend and resume methods are executed when a device
on a bus of the given type must change its power state.
每个类都由一个对象来描述class。所有类对象都属于class_subsys与/sys/class目录关联的子系统。class此外,每个
对象都包含一个嵌入式子系统;因此,例如,存在
与
设备驱动程序模型的输入类关联的/sys/class/input目录。
Each class is described by a class object. All class objects belong to
the class_subsys subsystem
associated with the /sys/class directory. Each
class object, moreover, includes
an embedded subsystem; thus, for example, there exists a
/sys/class/input directory associated with the
input class of the device driver model.
每个类对象都包含一个描述符列表class_device,每个描述符代表属于该类的单个逻辑设备。该class_device
结构包括一个dev指向device
描述符的字段,因此逻辑设备始终引用设备驱动程序模型中的给定设备。然而,可以有几个class_device引用同一设备的描述符。事实上,一个硬件设备可能包括多个不同的子设备,每个子设备都需要不同的用户模式接口。例如,声卡是一种硬件设备,通常包括DSP、混音器、游戏端口接口等;每个子设备都需要自己的用户模式接口,因此它与sysfs
文件系统中自己的目录相关联。
Each class object includes a list of class_device descriptors, each of which
represents a single logical device belonging to
the class. The class_device
structure includes a dev field
that points to a device
descriptor, thus a logical device always refers to a given device in
the device driver model. However, there can be several class_device descriptors that refer to the
same device. In fact, a hardware device might include several
different sub-devices, each of which requires a different User Mode
interface. For example, the sound card is a hardware device that
usually includes a DSP, a mixer, a game port interface, and so on;
each sub-device requires its own User Mode interface, thus it is
associated with its own directory in the sysfs
filesystem.
同一类中的设备驱动程序应为用户模式应用程序提供相同的功能;例如,声卡的所有设备驱动程序都应该提供一种将声音样本写入 DSP 的方法。
Device drivers in the same class are expected to offer the same functionalities to the User Mode applications; for instance, all device drivers of sound cards should offer a way to write sound samples to the DSP.
课程设备驱动程序模型的本质目的是提供一种标准方法,用于将逻辑设备的接口导出到用户模式应用程序。每个描述符嵌入一个具有名为dev 的class_device属性(特殊文件)的 kobject 。该属性存储访问相应逻辑设备所需的设备文件的主设备号和次设备号(参见下一节)。
The classes of the device driver model are essentially aimed to
provide a standard method for exporting to User Mode applications
the interfaces of the logical devices . Each class_device descriptor embeds a kobject
having an attribute (special file) named dev. Such attribute stores the major and
minor numbers of the device file that is needed to access to the
corresponding logical device (see the next section).
正如第 1 章中提到的,类 Unix 操作系统基于文件的概念,文件只是一个字节序列结构的信息容器。根据这种方法,I/O 设备被视为称为设备文件的特殊文件 ; 因此,用于与磁盘上的常规文件交互的相同系统调用可用于直接与 I/O 设备交互。例如,同样的write( )
系统调用可用于将数据写入常规文件或通过写入/dev/lp0设备文件将其发送到打印机。
As mentioned in Chapter
1, Unix-like operating systems are based on the notion of a file,
which is just an information container structured as a sequence of
bytes. According to this approach, I/O devices are treated as special
files called device files ; thus, the same system calls used to interact with
regular files on disk can be used to directly interact with I/O devices.
For example, the same write( )
system call may be used to write data into a regular file
or to send it to a printer by writing to the /dev/lp0 device file.
根据底层设备驱动的特点,设备文件可以有两种类型:块或 字符。两类硬件设备之间的区别并不是那么明显。至少我们可以假设以下几点:
According to the characteristics of the underlying device drivers, device files can be of two types: block or character. The difference between the two classes of hardware devices is not so clear-cut. At least we can assume the following:
块设备的数据可以随机寻址,并且传输数据块所需的时间很小并且大致相同,至少从人类用户的角度来看是这样。块设备的典型例子是硬盘、软盘、CD-ROM 驱动器和 DVD 播放器。
The data of a block device can be addressed randomly, and the time needed to transfer a data block is small and roughly the same, at least from the point of view of the human user. Typical examples of block devices are hard disks, floppy disks , CD-ROM drives, and DVD players.
字符设备的数据要么不能随机寻址(例如,考虑声卡),要么可以随机寻址,但访问随机数据所需的时间很大程度上取决于其在设备内的位置(考虑,对于例如,磁带驱动器)。
The data of a character device either cannot be addressed randomly (consider, for instance, a sound card), or they can be addressed randomly, but the time required to access a random datum largely depends on its position inside the device (consider, for instance, a magnetic tape driver).
网卡是此架构的一个显着例外,因为它们是不与设备文件直接关联的硬件设备。
Network cards are a notable exception to this schema, because they are hardware devices that are not directly associated with device files.
设备文件从 Unix 操作系统的早期版本就开始使用。设备文件通常是存储在文件系统中的真实文件。然而,它的 inode 不需要包含指向磁盘上数据块(文件数据)的指针,因为不存在。相反,索引节点必须包含与字符或块设备文件对应的硬件设备的标识符。
Device files have been in use since the early versions of the Unix operating system. A device file is usually a real file stored in a filesystem. Its inode, however, doesn't need to include pointers to blocks of data on the disk (the file's data) because there are none. Instead, the inode must include an identifier of the hardware device corresponding to the character or block device file.
传统上,该标识符由设备文件的类型(字符或块)和一对数字组成。第一个数字称为 主设备号,用于标识设备类型。传统上,具有相同主编号和相同类型的所有设备文件共享相同的文件操作集,因为它们由相同的设备驱动程序处理。第二个数字称为 次设备号,用于标识共享相同主设备号的一组设备中的特定设备。例如,同一磁盘控制器管理的一组磁盘具有相同的主编号和不同的次编号。
Traditionally, this identifier consists of the type of device file (character or block) and a pair of numbers. The first number, called the major number, identifies the device type. Traditionally, all device files that have the same major number and the same type share the same set of file operations, because they are handled by the same device driver. The second number, called the minor number, identifies a specific device among a group of devices that share the same major number. For instance, a group of disks managed by the same disk controller have the same major number and different minor numbers .
这mknod( ) 系统调用用于创建设备文件。它接收设备文件的名称、类型以及主设备号和次设备号作为其参数。设备文件通常包含在/dev目录中。表 13-7说明了一些设备文件的属性。请注意,字符设备和块设备具有独立的编号,因此块设备 (3,0) 与字符设备 (3,0) 不同。
The mknod( ) system call is used to create device files. It receives
the name of the device file, its type, and the major and minor numbers
as its parameters. Device files are usually included in the /dev directory. Table 13-7 illustrates the
attributes of some device files. Notice that character and block devices
have independent numbering, so block device (3,0) is different from
character device (3,0).
表 13-7。设备文件示例
Table 13-7. Examples of device files
姓名 Name | 类型 Type | 主要的 Major | 次要的 Minor | 描述 Description |
|---|---|---|---|---|
/dev/fd0 /dev/fd0 | 堵塞 block | 2 2 | 0 0 | 软盘 Floppy disk |
/dev/hda /dev/hda | 堵塞 block | 3 3 | 0 0 | 第一个 IDE 磁盘 First IDE disk |
/dev/hda2 /dev/hda2 | 堵塞 block | 3 3 | 2 2 | 第一个 IDE 磁盘的第二个主分区 Second primary partition of first IDE disk |
/dev/hdb /dev/hdb | 堵塞 block | 3 3 | 64 64 | 第二个 IDE 磁盘 Second IDE disk |
/dev/hdb3 /dev/hdb3 | 堵塞 block | 3 3 | 67 67 | 第二个 IDE 磁盘的第三个主分区 Third primary partition of second IDE disk |
/dev/ttyp0 /dev/ttyp0 | 字符 char | 3 3 | 0 0 | 终端 Terminal |
/开发/控制台 /dev/console | 字符 char | 5 5 | 1 1 | 安慰 Console |
/dev/lp1 /dev/lp1 | 字符 char | 6 6 | 1 1 | 并行打印机 Parallel printer |
/dev/ttyS0 /dev/ttyS0 | 字符 char | 4 4 | 64 64 | 第一个串口 First serial port |
/dev/rtc /dev/rtc | 字符 char | 10 10 | 135 135 | 实时时钟 Real-time clock |
/dev/空 /dev/null | 字符 char | 1 1 | 3 3 | 空设备(黑洞) Null device (black hole) |
通常,设备文件与硬件设备(例如硬盘,例如/dev/hda)或硬件设备的某些物理或逻辑部分(例如磁盘分区,例如 /dev/ HDA2)。然而,在某些情况下,设备文件不与任何真实的硬件设备关联,而是代表虚构的逻辑设备。例如,/dev/null是对应于“黑洞”的设备文件;所有写入其中的数据都会被丢弃,并且该文件始终显示为空。
Usually, a device file is associated with a hardware device (such as a hard disk—for instance, /dev/hda) or with some physical or logical portion of a hardware device (such as a disk partition—for instance, /dev/hda2). In some cases, however, a device file is not associated with any real hardware device, but represents a fictitious logical device. For instance, /dev/null is a device file corresponding to a "black hole;" all data written into it is simply discarded, and the file always appears empty.
就内核而言,设备文件的名称是无关紧要的。如果您创建一个名为/tmp/disk且类型为“block”、主设备号为 3、次设备号为 0 的设备文件,则它将相当于表中所示的/dev/hda设备文件。另一方面,设备文件名对于某些应用程序可能很重要。例如,通信程序可能假设第一个串行端口与/dev/ttyS0设备文件关联。但大多数应用程序都可以配置为与任意命名的设备文件交互。
As far as the kernel is concerned, the name of the device file is irrelevant. If you create a device file named /tmp/disk of type "block" with the major number 3 and minor number 0, it would be equivalent to the /dev/hda device file shown in the table. On the other hand, device filenames may be significant for some application programs. For example, a communication program might assume that the first serial port is associated with the /dev/ttyS0 device file. But most application programs can be configured to interact with arbitrarily named device files.
在传统的 Unix 系统(以及早期版本的 Linux)中,设备文件的主设备号和次设备号都是 8 位长。因此,最多可以有 65,536 个块设备文件和 65,536 个字符设备文件。您可能认为它们就足够了,但不幸的是它们还不够。
In traditional Unix systems (and in earlier versions of Linux), the major and minor numbers of the device files are 8 bits long. Thus, there could be at most 65,536 block device files and 65,536 character device files. You might expect they will suffice, but unfortunately they don't.
真正的问题是,传统上,设备文件在/dev目录中分配一次并永远;因此,系统中的每个逻辑设备都应该有一个关联的设备文件,并具有明确的设备号。分配的设备号和/dev目录节点的官方注册表存储在Documentation/devices.txt文件中;与设备主设备号对应的宏也可以在include/linux/major.h 文件中找到。
The real problem is that device files are traditionally allocated once and forever in the /dev directory; therefore, each logical device in the system should have an associated device file with a well-defined device number. The official registry of allocated device numbers and /dev directory nodes is stored in the Documentation/devices.txt file; the macros corresponding to the major numbers of the devices may also be found in the include/linux/major.h file.
不幸的是,现在不同的硬件设备的数量如此之大,以至于几乎所有的设备编号都已经被分配了。官方的设备号注册对于一般的 Linux 系统来说效果很好;然而,它可能不太适合大型系统。此外,高端系统可能会使用数百或数千个相同类型的磁盘,而 8 位次要编号是不够的。例如,注册表为 16 个 SCSI 磁盘保留设备号,每个磁盘有 15 个分区;如果高端系统有超过16个SCSI磁盘,则必须更改主编号和次编号的标准分配——这是一项不小的任务,需要修改内核源代码,并使系统难以维护。
Unfortunately, the number of different hardware devices is so large nowadays that almost all device numbers have already been allocated. The official registry of device numbers works well for the average Linux system; however, it may not be well suited for large-scale systems. Furthermore, high-end systems may use hundreds or thousands of disks of the same type, and an 8-bit minor number is not sufficient. For instance, the registry reserves device numbers for 16 SCSI disks having 15 partitions each; if a high-end system has more than 16 SCSI disks, the standard assignment of major and minor numbers has to be changed—a non trivial task that requires modifying the kernel source code and makes the system hard to maintain.
为了解决此类问题,Linux 2.6 中增加了设备号的大小:主设备号现在以 12 位编码,而次设备号以 20 位编码。这两个数字通常保存在一个 32 位类型的变量中dev_t;和宏分别从值中提取主设备号和次设备号
,而宏则MAJOR对值中的两个设备号进行编码。为了向后兼容,内核正确处理用 16 位设备号编码的旧设备文件。MINORdev_tMKDEVdev_t
In order to solve this kind of problem, the size of the device
numbers has been increased in Linux 2.6: the major number is now
encoded in 12 bits, while the minor number is encoded in 20 bits. Both
numbers are usually kept in a single 32-bit variable of type dev_t; the MAJOR and MINOR macros extract the major and minor
numbers, respectively, from a dev_t
value, while the MKDEV macro
encodes the two device numbers in a dev_t value. For backward compatibility, the
kernel handles properly old device files encoded with 16-bit device
numbers.
额外的可用设备编号并未在官方注册表中静态分配,因为仅在处理对设备编号的异常需求时才应使用它们。实际上,当今处理设备文件的首选方式在设备编号分配和设备文件创建方面都是高度动态的。
The additional available device numbers are not being statically allocated in the official registry, because they should be used only when dealing with unusual demands for device numbers. Actually, today's preferred way to deal with device files is highly dynamic, both in the device number assignment and in the device file creation.
每个设备驱动程序在注册阶段指定它将要处理的设备号范围(请参阅后面的“设备驱动程序注册”部分)。但是,驱动程序可以要求分配设备编号区间,而不指定确切的值:在这种情况下,内核会分配合适的编号范围并将它们分配给驱动程序。
Each device driver specifies in the registration phase the range of device numbers that it is going to handle (see the later section "Device Driver Registration"). The driver can, however, require the allocation of an interval of device numbers without specifying the exact values: in this case, the kernel allocates a suitable range of numbers and assigns them to the driver.
因此,新硬件设备的设备驱动程序不再需要在官方注册表中分配设备号;他们可以简单地使用系统中当前可用的任何号码。
Therefore, device drivers of new hardware devices no longer require an assignment in the official registry of device numbers; they can simply use whatever numbers are currently available in the system.
然而,在这种情况下,设备文件不能一次性创建;它必须在设备驱动程序初始化后立即使用正确的主设备号和次设备号创建。因此,必须有一种标准方法将每个驱动程序使用的设备号导出到用户模式应用程序。正如我们在前面的“设备驱动程序模型的组件”部分中所看到的,设备驱动程序模型提供了一个优雅的解决方案:主设备号和次设备号存储在 /sys/class子目录中包含的dev属性中。
In this case, however, the device file cannot be created once and forever; it must be created right after the device driver initialization with the proper major and minor numbers. Thus, there must be a standard way to export the device numbers used by each driver to the User Mode applications. As we have seen in the earlier section "Components of the Device Driver Model," the device driver model provides an elegant solution: the major and minor numbers are stored in the dev attributes contained in the subdirectories of /sys/class.
Linux 内核可以动态创建设备文件:无需用每个可以想象的硬件设备的设备文件填充/dev目录,因为设备文件可以“按需”创建。由于设备驱动程序模型,内核 2.6 提供了一种非常简单的方法来做到这一点。系统中必须安装一组用户模式程序,统称为udev工具集。系统启动时/dev目录被清空,然后 udev程序扫描/sys/class的子目录 寻找 dev文件。对于每个这样的文件(代表内核支持的逻辑设备的主设备号和次设备号的组合),程序会在/dev中创建相应的设备文件。它还根据配置文件分配设备文件名并创建符号链接,其方式类似于 Unix 设备文件的传统命名方案。最终,/dev充满了该系统上内核支持的所有设备的设备文件,没有其他内容。
The Linux kernel can create the device files dynamically: there is no need to fill the /dev directory with the device files of every conceivable hardware device, because the device files can be created "on demand." Thanks to the device driver model, the kernel 2.6 offers a very simple way to do so. A set of User Mode programs, collectively known as the udev toolset, must be installed in the system. At the system startup the /dev directory is emptied, then a udev program scans the subdirectories of /sys/class looking for the dev files. For each such file, which represents a combination of major and minor number for a logical device supported by the kernel, the program creates a corresponding device file in /dev. It also assigns device filenames and creates symbolic links according to a configuration file, in such a way to resemble the traditional naming scheme for Unix device files. Eventually, /dev is filled with the device files of all devices supported by the kernel on this system, and nothing else.
通常,设备文件是在系统初始化后创建的。当加载包含仍不受支持的设备的设备驱动程序的模块时,或者当将热插拔设备(例如 USB 外围设备)插入系统时,就会发生这种情况。udev工具集可以自动创建相应的设备文件,因为设备驱动模型支持设备热插拔 。每当发现新设备时,内核都会生成一个新进程,该进程执行用户模式/sbin/hotplug shell 脚本,[ * ]将发现的设备上的任何有用信息作为环境变量传递给它。用户模式脚本通常读取配置文件并负责完成新设备初始化所需的任何操作。如果 安装了udev ,该脚本还会在/dev目录中创建正确的设备文件 。
Often a device file is created after the system has been initialized. This happens either when a module containing a device driver for a still unsupported device is loaded, or when a hot-pluggable device—such as a USB peripheral—is plugged in the system. The udev toolset can automatically create the corresponding device file, because the device driver model supports device hotplugging . Whenever a new device is discovered, the kernel spawns a new process that executes the User Mode /sbin/hotplug shell script,[*] passing to it any useful information on the discovered device as environment variables. The User Mode scripts usually reads a configuration file and takes care of any operation required to complete the initialization of the new device. If udev is installed, the script also creates the proper device file in the /dev directory.
设备文件位于系统目录树中,但本质上与常规文件和目录不同。当进程访问常规文件时,它是通过文件系统访问磁盘分区中的一些数据块;当一个进程访问一个设备文件时,它只是在驱动一个硬件设备。例如,进程可能会访问设备文件以从连接到计算机的数字温度计读取室温。VFS 的责任是向应用程序隐藏设备文件和常规文件之间的差异。
Device files live in the system directory tree but are intrinsically different from regular files and directories. When a process accesses a regular file, it is accessing some data blocks in a disk partition through a filesystem; when a process accesses a device file, it is just driving a hardware device. For instance, a process might access a device file to read the room temperature from a digital thermometer connected to the computer. It is the VFS's responsibility to hide the differences between device files and regular files from application programs.
为此,VFS 会更改设备文件打开时的默认文件操作;结果,设备文件上的每个系统调用都被转换为对设备相关函数的调用,而不是托管文件系统的相应函数的调用。设备相关函数作用于硬件设备,执行进程请求的操作。[ † ]
To do this, the VFS changes the default file operations of a device file when it is opened; as a result, each system call on the device file is translated to an invocation of a device-related function instead of the corresponding function of the hosting filesystem. The device-related function acts on the hardware device to perform the operation requested by the process.[†]
假设一个进程执行一个open( ) 对设备文件(块类型或字符类型)的系统调用。系统调用执行的操作已经在第12章的“ open()系统调用”部分中描述过。本质上,相应的服务例程解析设备文件的路径名并设置相应的inode对象、dentry对象和文件对象。
Let's suppose that a process executes an open( ) system call on a device file (either of type block or
character). The operations performed by the system call have already
been described in the section "The open( ) System Call"
in Chapter 12. Essentially,
the corresponding service routine resolves the pathname to the device
file and sets up the corresponding inode object, dentry object, and
file object.
inode 对象是通过文件系统的适当函数读取磁盘上相应的 inode 来初始化的(通常
ext2_read_inode( )是 或ext3_read_inode( );参见第 18 章)。当此函数确定磁盘 inode 与设备文件相关时,它会调用,该函数将inode 对象的字段init_special_inode( )初始化为设备文件的主编号和次编号,并将inode 对象的字段设置为以下任一地址:或文件操作表,根据设备文件的类型。系统调用的服务例程也会调用该
函数,该函数分配一个新的文件对象并设置其i_rdevi_fopdef_blk_fopsdef_chr_fopsopen( )dentry_open( )f_op字段到存储在的地址,即再次到或i_fop的地址。由于这两个表,在设备文件上发出的每个系统调用都将激活设备驱动程序的功能,而不是底层文件系统的功能。def_blk_fopsdef_chr_fops
The inode object is initialized by reading the corresponding
inode on disk through a suitable function of the filesystem (usually
ext2_read_inode( ) or ext3_read_inode( ); see Chapter 18). When this function
determines that the disk inode is relative to a device file, it
invokes init_special_inode( ),
which initializes the i_rdev field
of the inode object to the major and minor numbers of the device file,
and sets the i_fop field of the
inode object to the address of either the def_blk_fops or the def_chr_fops file operation table, according
to the type of device file. The service routine of the open( ) system call also invokes the
dentry_open( ) function, which
allocates a new file object and sets its f_op field to the address stored in i_fop—that is, to the address of def_blk_fops or def_chr_fops once again. Thanks to these two
tables, every system call issued on a device file will activate a
device driver's function rather than a function of the underlying
filesystem.
[ * ]热插拔事件时调用的用户模式程序的路径名可以通过写入 /proc/sys/kernel/hotplug文件来更改。
[*] The pathname of the User Mode program invoked on hot-plugging events can be changed by writing into the /proc/sys/kernel/hotplug file.
[ † ]请注意,由于第 12 章“路径名查找” 部分中解释的名称解析机制,设备文件的符号链接就像设备文件一样工作。
[†] Notice that, thanks to the name-resolving mechanism explained in the section "Pathname Lookup" in Chapter 12, symbolic links to device files work just like device files.
设备驱动程序是一组内核例程,使硬件设备响应由控制设备的规范 VFS 函数集(open、 read、lseek、ioctl等)定义的编程接口。所有这些功能的实际实现都委托给设备驱动程序。由于每个设备都有不同的 I/O 控制器,从而有不同的命令和不同的状态信息,因此大多数 I/O 设备都有自己的驱动程序。
A device driver is the set of kernel routines that makes a hardware device respond to the programming interface defined by the canonical set of VFS functions (open, read, lseek, ioctl, and so forth) that control a device. The actual implementation of all these functions is delegated to the device driver. Because each device has a different I/O controller, and thus different commands and different state information, most I/O devices have their own drivers.
设备驱动程序有很多种类型。它们的主要区别在于为用户模式应用程序提供的支持级别以及从硬件设备收集的数据的缓冲策略。由于这些选择极大地影响设备驱动程序的内部结构,因此我们在“直接内存访问 (DMA) ”和“字符设备的缓冲策略”部分中讨论它们。
There are many types of device drivers . They mainly differ in the level of support that they offer to the User Mode applications, as well as in their buffering strategies for the data collected from the hardware devices. Because these choices greatly influence the internal structure of a device driver, we discuss them in the sections "Direct Memory Access (DMA)" and "Buffering Strategies for Character Devices."
设备驱动程序不仅仅包含实现设备文件操作的函数。在使用设备驱动程序之前,必须先进行几项活动。我们将在以下部分中研究它们。
A device driver does not consist only of the functions that implement the device file operations. Before using a device driver, several activities must have taken place. We'll examine them in the following sections.
我们知道,对设备文件发出的每个系统调用都会被内核转换为对相应设备驱动程序的适当函数的调用。为了实现这一点,设备驱动程序必须
注册自身。换句话说,注册设备驱动程序意味着分配一个新的device_driver描述符,将其插入设备驱动程序模型的数据结构中(请参阅前面的“设备驱动程序模型的组件”部分),并将其链接到相应的设备文件。访问其相应驱动程序先前未注册的设备文件将返回错误代码-ENODEV。
We know that each system call issued on a device file is
translated by the kernel into an invocation of a suitable function of
a corresponding device driver. To achieve this, a device driver must
register itself. In other words,
registering a device driver means allocating a new device_driver descriptor, inserting it in
the data structures of the device driver model (see the earlier
section "Components of
the Device Driver Model"), and linking it to the corresponding
device file(s). Accesses to device files whose corresponding drivers
have not been previously registered return the error code -ENODEV.
如果设备驱动程序是在内核中静态编译的,则其注册是在内核初始化阶段执行的。相反,如果设备驱动程序被编译为内核模块(请参阅 附录 B),则在加载模块时执行其注册。在后一种情况下,设备驱动程序也可以在卸载模块时注销自身。
If a device driver is statically compiled in the kernel, its registration is performed during the kernel initialization phase. Conversely, if a device driver is compiled as a kernel module (see Appendix B), its registration is performed when the module is loaded. In the latter case, the device driver can also unregister itself when the module is unloaded.
例如,让我们考虑一个通用 PCI 设备。为了正确处理它,它的设备驱动程序必须分配一个类型为 的描述符
pci_driver,PCI 内核层使用该描述符来处理设备。初始化该描述符的某些字段后,设备驱动程序调用该pci_register_driver( )函数。实际上,pci_driver描述符包括一个嵌入式device_driver
描述符(参见前面的“设备驱动程序模型的组件”部分);只需pci_register_function(
)初始化嵌入式驱动程序描述符的字段,并调用driver_register(
)将驱动程序插入设备驱动程序模型的数据结构中。
Let us consider, for instance, a generic PCI device. To properly
handle it, its device driver must allocate a descriptor of type
pci_driver, which is used by the
PCI kernel layer to handle the device. After having initialized some
fields of this descriptor, the device driver invokes the pci_register_driver( ) function. Actually,
the pci_driver descriptor includes
an embedded device_driver
descriptor (see the earlier section "Components of the Device Driver
Model"); the pci_register_function(
) simply initializes the fields of the embedded driver
descriptor and invokes driver_register(
) to insert the driver in the data structures of the device
driver model.
注册设备驱动程序时,内核会查找驱动程序可能处理的不受支持的硬件设备。为此,它依赖于match相关bus_type总线类型描述符的方法和
对象probe的方法device_driver。如果发现驱动程序可以处理的硬件设备,内核会分配一个device对象并调用
device_register( )将该设备插入设备驱动程序模型中。
When a device driver is being registered, the kernel looks for
unsupported hardware devices that could be possibly handled by the
driver. To do this, it relies on the match method of the relevant bus_type bus type descriptor, and on the
probe method of the device_driver object. If a hardware device
that can be handled by the driver is discovered, the kernel allocates
a device object and invokes
device_register( ) to insert the
device in the device driver model.
注册设备驱动程序和初始化它是两件不同的事情。设备驱动程序会尽快注册,以便用户模式应用程序可以通过相应的设备文件使用它。相反,设备驱动程序是在最后可能的时刻初始化的。事实上,初始化驱动程序意味着分配宝贵的系统资源,因此其他驱动程序无法使用这些资源。
Registering a device driver and initializing it are two different things. A device driver is registered as soon as possible, so User Mode applications can use it through the corresponding device files. In contrast, a device driver is initialized at the last possible moment. In fact, initializing a driver means allocating precious resources of the system, which are therefore not available to other drivers.
我们已经在第 4 章的“ I/O 中断处理”部分看到过一个例子:设备的 IRQ 分配通常是在使用它们之前动态分配的,因为多个设备可能共享同一条 IRQ 线。可以在最后一刻分配的其他资源是 DMA 传输缓冲区的页帧和 DMA 通道本身(对于旧的非 PCI 设备,例如软盘驱动器)。
We already have seen an example in the section "I/O Interrupt Handling" in Chapter 4: the assignment of IRQs to devices is usually made dynamically, right before using them, because several devices may share the same IRQ line. Other resources that can be allocated at the last possible moment are page frames for DMA transfer buffers and the DMA channel itself (for old non-PCI devices such as the floppy disk driver).
为了确保在需要时获取资源,但在已授予资源时不会以冗余方式请求资源,设备驱动程序通常采用以下模式:
To make sure the resources are obtained when needed but are not requested in a redundant manner when they have already been granted, device drivers usually adopt the following schema:
使用计数器跟踪当前正在访问设备文件的进程数。计数器在open设备文件的方法中增加,在设备文件的release方法中减少。[ * ]
A usage counter keeps track of the number of processes that
are currently accessing the device file. The counter is increased
in the open method of the
device file and decreased in the release method.[*]
该open方法在增量之前检查使用计数器的值。如果计数器为零,则设备驱动程序必须分配资源并在硬件设备上启用中断和 DMA。
The open method checks
the value of the usage counter before the increment. If the
counter is zero, the device driver must allocate the resources and
enable interrupts and DMA on the hardware device.
该release方法在递减后检查使用计数器的值。如果计数器为零,则表明没有更多进程正在使用该硬件设备。如果是,该方法禁用 I/O 控制器上的中断和 DMA,然后释放分配的资源。
The release method checks
the value of the usage counter after the decrement. If the counter
is zero, no more processes are using the hardware device. If so,
the method disables interrupts and DMA on the I/O controller, and
then releases the allocated resources.
I/O 操作的持续时间通常是不可预测的。它可以取决于机械考虑因素(磁盘头相对于要传输的块的当前位置)、真正的随机事件(当数据包到达网卡时)或人为因素(当用户按下键或当她注意到打印机发生卡纸时)。无论如何,启动 I/O 操作的设备驱动程序必须依赖于一个监视发出 I/O 操作终止或超时信号的技术。
The duration of an I/O operation is often unpredictable. It can depend on mechanical considerations (the current position of a disk head with respect to the block to be transferred), on truly random events (when a data packet arrives on the network card), or on human factors (when a user presses a key on the keyboard or when she notices that a paper jam occurred in the printer). In any case, the device driver that started an I/O operation must rely on a monitoring technique that signals either the termination of the I/O operation or a time-out.
在操作终止的情况下,设备驱动程序读取I/O接口的状态寄存器以确定I/O操作是否成功执行。在超时的情况下,驱动程序知道出了问题,因为允许完成操作的最大时间间隔已经过去,并且什么也没有发生。
In the case of a terminated operation, the device driver reads the status register of the I/O interface to determine whether the I/O operation was carried out successfully. In the case of a time-out, the driver knows that something went wrong, because the maximum time interval allowed to complete the operation elapsed and nothing happened.
可用于监视 I/O 操作结束的两种技术称为轮询模式 和中断模式。
The two techniques available to monitor the end of an I/O operation are called the polling mode and the interrupt mode.
根据该技术,CPU 重复检查(轮询)设备的状态寄存器,直到其值表明 I/O 操作已完成。我们已经在第 5 章的“自旋锁”一节中遇到过一种基于轮询的技术:当处理器尝试获取繁忙的自旋锁时,它会重复轮询变量,直到其值变为 0。然而,轮询应用于 I/O操作通常更加复杂,因为驱动程序还必须记住检查可能的超时。轮询的一个简单示例如下所示:
According to this technique, the CPU checks (polls) the device's status register repeatedly until its value signals that the I/O operation has been completed. We have already encountered a technique based on polling in the section "Spin Locks" in Chapter 5: when a processor tries to acquire a busy spin lock, it repeatedly polls the variable until its value becomes 0. However, polling applied to I/O operations is usually more elaborate, because the driver must also remember to check for possible time-outs. A simple example of polling looks like the following:
为了 (;;) {
if (read_status(device) & DEVICE_END_OPERATION) 中断;
if (--count == 0) 中断;
} for (;;) {
if (read_status(device) & DEVICE_END_OPERATION) break;
if (--count == 0) break;
}该count变量在进入循环之前初始化,每次迭代都会减少,因此可用于实现粗略的超时机制。或者,可以通过在每次迭代时读取滴答计数器的值(请参阅第 6 章中的“更新时间和日期”jiffies部分)并将其与开始等待之前读取的旧值进行比较来实现更精确的超时机制。环形。
The count variable, which
was initialized before entering the loop, is decreased at each
iteration, and thus can be used to implement a rough time-out
mechanism. Alternatively, a more precise time-out mechanism could be
implemented by reading the value of the tick counter jiffies at each iteration (see the section
"Updating the Time and
Date" in Chapter 6)
and comparing it with the old value read before starting the wait
loop.
如果完成 I/O 操作所需的时间相对较长(例如毫秒级),则此模式会变得低效,因为 CPU 在等待 I/O 操作完成时浪费了宝贵的机器周期。在这种情况下,最好在每次轮询操作后通过schedule( )在循环内插入函数调用来自动放弃 CPU。
If the time required to complete the I/O operation is
relatively high, say in the order of milliseconds, this schema
becomes inefficient because the CPU wastes precious machine cycles
while waiting for the I/O operation to complete. In such cases, it
is preferable to voluntarily relinquish the CPU after each polling
operation by inserting an invocation of the schedule( ) function inside the
loop.
仅当 I/O 控制器能够通过 IRQ 线发出 I/O 操作结束信号时,才能使用中断模式。
Interrupt mode can be used only if the I/O controller is capable of signaling, via an IRQ line, the end of an I/O operation.
我们将展示中断模式如何适用于一个简单的案例。假设我们要为简单的输入字符设备实现一个驱动程序。当用户发出read( ) 在相应的设备文件上进行系统调用,输入命令被发送到设备的控制寄存器。经过不可预测的长时间间隔后,设备将单个字节的数据放入其输入寄存器中。然后设备驱动程序返回该字节作为系统调用的结果read( )
。
We'll show how interrupt mode works on a simple case. Let's suppose we want to
implement a driver for a simple input character device. When the
user issues a read( ) system call on the corresponding device file, an
input command is sent to the device's control register. After an
unpredictably long time interval, the device puts a single byte of
data in its input register. The device driver then returns this byte
as the result of the read( )
system call.
这是一个典型的情况,最好使用中断模式来实现驱动程序。本质上,驱动程序包括两个功能:
This is a typical case in which it is preferable to implement the driver using the interrupt mode. Essentially, the driver includes two functions:
foo_read( )
实现read文件对象方法的函数。
The foo_read( )
function that implements the read method of the file object.
foo_interrupt( )
处理中断的函数。
The foo_interrupt( )
function that handles the interrupt.
foo_read( )每当用户读取设备文件时就会触发该函数:
The foo_read( ) function is
triggered whenever the user reads the device file:
ssize_t foo_read(struct file *filp, char *buf, size_t count,
loff_t *ppos)
{
foo_dev_t * foo_dev = filp->private_data;
if (down_interruptible(&foo_dev->sem)
返回-ERESTARTSYS;
foo_dev->intr = 0;
outb(DEV_FOO_READ, DEV_FOO_CONTROL_PORT);
wait_event_interruptible(foo_dev->wait, (foo_dev->intr= =1));
if (put_user(foo_dev->data, buf))
返回-EFAULT;
向上(&foo_dev->sem);
返回1;
} ssize_t foo_read(struct file *filp, char *buf, size_t count,
loff_t *ppos)
{
foo_dev_t * foo_dev = filp->private_data;
if (down_interruptible(&foo_dev->sem)
return -ERESTARTSYS;
foo_dev->intr = 0;
outb(DEV_FOO_READ, DEV_FOO_CONTROL_PORT);
wait_event_interruptible(foo_dev->wait, (foo_dev->intr= =1));
if (put_user(foo_dev->data, buf))
return -EFAULT;
up(&foo_dev->sem);
return 1;
}设备驱动程序依赖于类型的自定义描述符
foo_dev_t;它包括一个sem保护硬件设备免受并发访问的信号量、一个等待队列wait、一个在设备发出中断时设置的标志,以及一个由中断处理程序写入并由该方法读取的intr单字节缓冲区
。一般来说,所有使用中断的 I/O 驱动程序都依赖于中断处理程序以及 和方法访问的数据结构。描述符的地址通常存储在设备文件的文件对象的字段中或全局变量中。datareadreadwritefoo_dev_tprivate_data
The device driver relies on a custom descriptor of type
foo_dev_t; it includes a
semaphore sem that protects the
hardware device from concurrent accesses, a wait queue wait, a flag intr that is set when the device issues an
interrupt, and a single-byte buffer data that is written by the interrupt
handler and read by the read
method. In general, all I/O drivers that use interrupts rely on data
structures accessed by both the interrupt handler and the read and write methods. The address of the foo_dev_t descriptor is usually stored in
the private_data field of the
device file's file object or in a global variable.
该foo_read(
)函数的主要操作如下:
The main operations of the foo_read(
) function are the following:
获取foo_dev->sem信号量,从而确保没有其他进程正在访问该设备。
Acquires the foo_dev->sem semaphore, thus
ensuring that no other process is accessing the device.
清除intr
标志。
Clears the intr
flag.
向 I/O 设备发出读取命令。
Issues the read command to the I/O device.
执行wait_event_interruptible以挂起进程,直到标志变为 1。该宏在第 3 章的“等待队列”
intr
部分中描述。
Executes wait_event_interruptible to suspend
the process until the intr
flag becomes 1. This macro is described in the section "Wait queues" in
Chapter 3.
一段时间后,我们的设备会发出中断信号,表示 I/O 操作已完成并且数据已在正确的DEV_FOO_DATA_PORT数据端口中准备就绪。中断处理程序设置intr标志并唤醒进程。当调度程序决定重新执行进程时,将执行第二部分
foo_read( )并执行以下操作:
After some time, our device issues an interrupt to signal that
the I/O operation is completed and that the data is ready in the
proper DEV_FOO_DATA_PORT data
port. The interrupt handler sets the intr flag and wakes the process. When the
scheduler decides to reexecute the process, the second part of
foo_read( ) is executed and does
the following:
将变量中准备好的字符复制foo_dev->data到用户地址空间。
Copies the character ready in the foo_dev->data variable into the
user address space.
释放foo_dev->sem信号量后终止。
Terminates after releasing the foo_dev->sem semaphore.
为简单起见,我们没有包含任何超时控制。一般情况下,超时控制是通过静态或动态定时器来实现的(见第6章);定时器必须在开始 I/O 操作之前设置为正确的时间,并在操作终止时删除。
For simplicity, we didn't include any time-out control. In general, time-out control is implemented through static or dynamic timers (see Chapter 6); the timer must be set to the right time before starting the I/O operation and removed when the operation terminates.
现在我们看一下该foo_interrupt( )函数的代码:
Let's now look at the code of the foo_interrupt( ) function:
irqreturn_t foo_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
foo->data = inb(DEV_FOO_DATA_PORT);
foo->intr = 1;
wake_up_interruptible(&foo->wait);
返回1;
} irqreturn_t foo_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
foo->data = inb(DEV_FOO_DATA_PORT);
foo->intr = 1;
wake_up_interruptible(&foo->wait);
return 1;
}中断处理程序从设备的输入寄存器中读取字符,并将其存储在全局变量指向的设备驱动程序的描述符data字段中。然后它设置
标志并调用以唤醒等待队列中阻塞的进程。foo_dev_tfoointrwake_up_interruptible(
)foo->wait
The interrupt handler reads the character from the input
register of the device and stores it in the data field of the foo_dev_t descriptor of the device driver
pointed to by the foo global
variable. It then sets the intr
flag and invokes wake_up_interruptible(
) to wake the process blocked in the foo->wait wait queue.
请注意,我们的中断处理程序未使用这三个参数。这是一个相当常见的案例。
Notice that none of the three parameters are used by our interrupt handler. This is a rather common case.
根据设备和总线类型,PC 架构中的 I/O 共享内存可能会映射到不同的物理地址范围内。通常:
Depending on the device and on the bus type, I/O shared memory in the PC's architecture may be mapped within different physical address ranges. Typically:
I/O 共享内存通常映射到从0xa0000到 的16 位物理地址0xfffff;这就产生了第 2 章“物理内存布局”一节中提到的 640 KB 和 1 MB 之间的“漏洞” 。
The I/O shared memory is usually mapped into the 16-bit
physical addresses ranging from 0xa0000 to 0xfffff; this gives rise to the "hole"
between 640 KB and 1 MB mentioned in the section "Physical Memory
Layout" in Chapter
2.
I/O 共享内存映射到 4 GB 边界附近的 32 位物理地址。这种设备操作起来要简单得多。
The I/O shared memory is mapped into 32-bit physical addresses near the 4 GB boundary. This kind of device is much simpler to handle.
几年前,英特尔推出了加速图形端口(AGP)标准,这是针对高性能显卡的 PCI 的增强。除了拥有自己的 I/O 共享内存之外,这种卡还能够通过名为图形地址重映射表(GART)的特殊硬件电路直接寻址主板 RAM 的部分内容。 )。GART 电路使 AGP 卡能够维持比旧 PCI 卡高得多的数据传输速率。然而,从内核的角度来看,物理内存位于何处并不重要,GART 映射内存的处理方式与其他类型的 I/O 共享内存类似。
A few years ago, Intel introduced the Accelerated Graphics Port (AGP) standard, which is an enhancement of PCI for high-performance graphic cards. Beside having its own I/O shared memory, this kind of card is capable of directly addressing portions of the motherboard's RAM by means of a special hardware circuit named Graphics Address Remapping Table (GART ). The GART circuitry enables AGP cards to sustain much higher data transfer rates than older PCI cards. From the kernel's point of view, however, it doesn't really matter where the physical memory is located, and GART-mapped memory is handled like the other kinds of I/O shared memory.
设备驱动程序如何访问 I/O 共享内存位置?我们先从处理起来相对简单的PC的架构开始,然后将讨论扩展到其他架构。
How does a device driver access an I/O shared memory location? Let's start with the PC's architecture, which is relatively simple to handle, and then extend the discussion to other architectures.
请记住,内核程序作用于线性地址,因此 I/O 共享内存位置必须表示为大于的地址PAGE_OFFSET。在下面的讨论中,我们假设PAGE_OFFSET等于0xc0000000——也就是说,内核线性地址位于第四个千兆字节。
Remember that kernel programs act on linear addresses, so the
I/O shared memory locations must be expressed as addresses greater
than PAGE_OFFSET. In the following
discussion, we assume that PAGE_OFFSET is equal to 0xc0000000—that is, that the kernel linear
addresses are in the fourth gigabyte.
设备驱动程序必须将 I/O 共享内存位置的 I/O 物理地址转换为内核空间中的线性地址。在 PC 架构中,这可以简单地通过将 32 位物理地址与常量进行“或”运算来实现0xc0000000。例如,假设内核需要将值存储在物理地址0x000b0fe4in的 I/O 位置中t1,并将值存储在物理地址 in 的 I/O 位置0xfc000000中
t2。人们可能认为以下语句可以完成这项工作:
Device drivers must translate I/O physical addresses of I/O
shared memory locations into linear addresses in kernel space. In the
PC architecture, this can be achieved simply by ORing the 32-bit
physical address with the 0xc0000000 constant. For instance, suppose
the kernel needs to store the value in the I/O location at physical
address 0x000b0fe4 in t1 and the value in the I/O location at
physical address 0xfc000000 in
t2. One might think that the
following statements could do the job:
t1 = *((无符号字符*)(0xc00b0fe4));
t2 = *((无符号字符*)(0xfc000000)); t1 = *((unsigned char *)(0xc00b0fe4));
t2 = *((unsigned char *)(0xfc000000));在初始化阶段,内核将可用 RAM 的物理地址映射到线性地址空间的第四 GB 的初始部分。因此,分页单元将0xc00b0fe4第一个语句中出现的线性地址映射回原始 I/O 物理地址,该地址位于 640 KB 到 1 MB 之间的“ISA 漏洞”内(请参阅第 2 章中的“ Linux 中的分页”0x000b0fe4部分))。这很好用。
During the initialization phase, the kernel maps the available
RAM's physical addresses into the initial portion of the fourth
gigabyte of the linear address space. Therefore, the Paging Unit maps
the 0xc00b0fe4 linear address
appearing in the first statement back to the original I/O physical
address 0x000b0fe4, which falls
inside the "ISA hole" between 640 KB and 1 MB (see the section "Paging in Linux" in Chapter 2). This works
fine.
然而,第二条语句有一个问题,因为 I/O 物理地址大于系统 RAM 的最后一个物理地址。因此,0xfc000000线性地址与物理地址并不对应0xfc000000
。在这种情况下,必须修改内核页表以包含映射 I/O 物理地址的线性地址。ioremap( )这可以通过调用或函数来完成ioremap_nocache( )。第一个函数类似于vmalloc(
),调用get_vm_area( )
创建一个新的描述符(请参阅第 8 章中的“非连续内存区域的描述符”vm_struct
部分))对于具有所需 I/O 共享内存区域大小的线性地址间隔。然后,这些函数会相应地更新规范内核页表的相应页表条目。该ioremap_nocache(
)函数的不同之处ioremap(
)在于,当正确引用重新映射的线性地址时,它还会禁用硬件缓存。
There is a problem, however, for the second statement, because
the I/O physical address is greater than the last physical address of
the system RAM. Therefore, the 0xfc000000 linear address does not
correspond to the 0xfc000000
physical address. In such cases, the kernel Page Tables must be
modified to include a linear address that maps the I/O physical
address. This can be done by invoking the ioremap( ) or ioremap_nocache( ) functions. The first
function, which is similar to vmalloc(
), invokes get_vm_area( )
to create a new vm_struct
descriptor (see the section "Descriptors of Noncontiguous
Memory Areas" in Chapter
8) for a linear address interval that has the size of the
required I/O shared memory area. The functions then update the
corresponding Page Table entries of the canonical kernel Page Tables
appropriately. The ioremap_nocache(
) function differs from ioremap(
) in that it also disables the hardware cache when
referencing the remapped linear addresses properly.
因此,第二个语句的正确形式可能如下所示:
The correct form for the second statement might therefore look like:
io_mem = ioremap(0xfb000000, 0x200000);
t2 = *((无符号字符 *)(io_mem + 0x100000)); io_mem = ioremap(0xfb000000, 0x200000);
t2 = *((unsigned char *)(io_mem + 0x100000));第一条语句创建一个新的 2 MB 线性地址间隔,映射从0xfb000000;开始的物理地址。第二个读取具有该0xfc000000
地址的内存位置。要稍后删除映射,设备驱动程序必须使用该
iounmap( )函数。
The first statement creates a new 2 MB linear address interval,
which maps physical addresses starting from 0xfb000000; the second one reads the memory
location that has the 0xfc000000
address. To remove the mapping later, the device driver must use the
iounmap( ) function.
在除 PC 之外的某些体系结构上,无法通过简单地取消引用指向物理内存位置的线性地址来访问 I/O 共享内存。因此,Linux 定义了以下与体系结构相关的函数,在访问 I/O 共享内存时应使用这些函数:
On some architectures other than the PC, I/O shared memory cannot be accessed by simply dereferencing the linear address pointing to the physical memory location. Therefore, Linux defines the following architecture-dependent functions, which should be used when accessing I/O shared memory:
readb( ), readw( ),readl( )readb( ), readw( ), readl( )从 I/O 共享内存位置分别读取 1、2 或 4 个字节
Reads 1, 2, or 4 bytes, respectively, from an I/O shared memory location
writeb( ), writew( ),writel( )writeb( ), writew( ), writel( )分别将 1、2 或 4 个字节写入 I/O 共享内存位置
Writes 1, 2, or 4 bytes, respectively, into an I/O shared memory location
memcpy_fromio( ),
memcpy_toio( )memcpy_fromio( ),
memcpy_toio( )将数据块从 I/O 共享内存位置复制到动态内存,反之亦然
Copies a block of data from an I/O shared memory location to dynamic memory and vice versa
memset_io( )memset_io( )用固定值填充 I/O 共享内存区域
Fills an I/O shared memory area with a fixed value
因此,访问0xfc000000I/O 位置的推荐方法是:
The recommended way to access the 0xfc000000 I/O location is thus:
io_mem = ioremap(0xfb000000, 0x200000);
t2 = readb(io_mem + 0x100000); io_mem = ioremap(0xfb000000, 0x200000);
t2 = readb(io_mem + 0x100000);借助这些函数,可以隐藏对特定于平台的 I/O 共享内存访问方式的所有依赖。
Thanks to these functions, all dependencies on platform-specific ways of accessing the I/O shared memory can be hidden.
在原始的 PC 架构中,CPU 是 系统中唯一的总线主控,即驱动地址/数据总线以在 RAM 位置获取和存储值的唯一硬件设备。对于 PCI 等更现代的总线架构,如果配备适当的电路,每个外设都可以充当总线主控器。因此,现在所有的 PC 都包含辅助DMA 电路,它可以在 RAM 和 I/O 设备之间传输数据。一旦被CPU激活,DMA就能够自行继续数据传输;当数据传输完成后,DMA发出中断请求。当CPU和DMA电路需要同时访问同一内存位置时发生的冲突可以通过称为内存仲裁器的硬件电路来解决(参见第5章中的“原子操作” 部分)。
In the original PC architecture, the CPU is the only bus master of the system, that is, the only hardware device that drives the address/data bus in order to fetch and store values in the RAM's locations. With more modern bus architectures such as PCI, each peripheral can act as bus master, if provided with the proper circuitry. Thus, nowadays all PCs include auxiliary DMA circuits , which can transfer data between the RAM and an I/O device. Once activated by the CPU, the DMA is able to continue the data transfer on its own; when the data transfer is completed, the DMA issues an interrupt request. The conflicts that occur when CPUs and DMA circuits need to access the same memory location at the same time are resolved by a hardware circuit called a memory arbiter (see the section "Atomic Operations" in Chapter 5).
DMA 主要由磁盘驱动程序和其他一次性传输大量字节的设备使用。由于 DMA 的建立时间相对较长,因此当字节数较小时,直接使用 CPU 进行数据传输效率更高。
The DMA is mostly used by disk drivers and other devices that transfer a large number of bytes at once. Because setup time for the DMA is relatively high, it is more efficient to directly use the CPU for the data transfer when the number of bytes is small.
旧 ISA 总线的第一个 DMA 电路非常复杂,难以编程,并且仅限于较低的 16 MB 物理内存。最新的 PCI 和 SCSI 总线 DMA 电路依赖于总线中的专用硬件电路,使设备驱动程序开发人员的工作更加轻松。
The first DMA circuits for the old ISA buses were complex, hard to program, and limited to the lower 16 MB of physical memory. More recent DMA circuits for the PCI and SCSI buses rely on dedicated hardware circuits in the buses and make life easier for device driver developers.
设备驱动程序可以通过两种不同的方式使用 DMA,称为同步 DMA和 异步 DMA。第一种情况,数据传输由进程触发;在第二种情况下,数据传输由硬件设备触发。
A device driver can use the DMA in two different ways called synchronous DMA and asynchronous DMA. In the first case, the data transfers are triggered by processes; in the second case the data transfers are triggered by hardware devices.
同步 DMA 的一个示例是播放音轨的声卡。用户模式应用程序将声音数据(称为 样本)写入与 数字信号处理器 (DSP)的声卡。声卡的设备驱动程序将这些样本累积在内核缓冲区中。同时,设备驱动程序指示声卡以明确定义的时序将样本从内核缓冲区复制到 DSP。当声卡完成数据传输时,它会引发一个中断,设备驱动程序检查内核缓冲区是否仍然包含尚未播放的样本;如果是,驱动程序将激活另一个 DMA 数据传输。
An example of synchronous DMA is a sound card that is playing a sound track. A User Mode application writes the sound data (called samples) on a device file associated with the digital signal processor (DSP) of the sound card. The device driver of the sound card accumulates these samples in a kernel buffer. At the same time, the device driver instructs the sound card to copy the samples from the kernel buffer to the DSP with a well-defined timing. When the sound card finishes the data transfer, it raises an interrupt, and the device driver checks whether the kernel buffer still contains samples yet to be played; if so, the driver activates another DMA data transfer.
异步 DMA 的一个示例是从 LAN 接收帧(数据包)的网卡。外设将帧存储在其 I/O 共享内存中,然后引发中断。网卡的设备驱动程序确认中断,然后指示外设将帧从 I/O 共享内存复制到内核缓冲区中。当数据传输完成时,网卡引发另一个中断,设备驱动程序通知上层内核层有关新帧的信息。
An example of asynchronous DMA is a network card that is receiving a frame (data packet) from a LAN. The peripheral stores the frame in its I/O shared memory, then raises an interrupt. The device driver of the network card acknowledges the interrupt, then instructs the peripheral to copy the frame from the I/O shared memory into a kernel buffer. When the data transfer completes, the network card raises another interrupt, and the device driver notifies the upper kernel layer about the new frame.
在为使用 DMA 的设备设计驱动程序时,开发人员应编写与体系结构无关的代码,并且就 DMA 而言,也应编写与总线无关的代码。由于内核提供了丰富的 DMA 辅助函数,这个目标现在是可行的。这些辅助函数隐藏了各种硬件架构的 DMA 机制的差异。
When designing a driver for a device that makes use of DMA, the developer should write code that is both architecture-independent and, as far as DMA is concerned, bus-independent. This goal is now feasible thanks to the rich set of DMA helper functions provided by the kernel. These helper functions hide the differences in the DMA mechanisms of the various hardware architectures.
DMA 辅助函数有两个子集:一个较旧的子集为 PCI 设备提供与架构无关的函数;另一个子集为 PCI 设备提供与架构无关的函数。更新的子集确保了总线和体系结构的独立性。现在我们将研究其中一些函数,同时指出 DMA 的一些硬件特性。
There are two subsets of DMA helper functions: an older subset provides architecture-independent functions for PCI devices; a more recent subset ensures both bus and architecture independence. We'll now examine some of these functions while pointing out some hardware peculiarities of DMAs.
每个 DMA 传输都涉及(至少)一个内存缓冲区,其中包含要由硬件设备读取或写入的数据。一般来说,在激活传输之前,设备驱动程序必须确保 DMA 电路可以直接访问 RAM 位置。
Every DMA transfer involves (at least) one memory buffer, which contains the data to be read or written by the hardware device. In general, before activating the transfer, the device driver must ensure that the DMA circuit can directly access the RAM locations.
到目前为止,我们已经区分了三种内存地址:逻辑和线性地址(CPU 内部使用)和物理地址(CPU 用于物理驱动数据总线的内存地址)。然而,还有第四种内存地址:所谓的总线地址。它对应于除CPU之外的所有硬件设备用来驱动数据总线的内存地址。
Until now we have distinguished three kinds of memory addresses: logical and linear addresses, which are used internally by the CPU, and physical addresses, which are the memory addresses used by the CPU to physically drive the data bus. However, there is a fourth kind of memory address: the so-called bus address. It corresponds to the memory addresses used by all hardware devices except the CPU to drive the data bus.
为什么内核应该关心总线地址?在 DMA 操作中,数据传输无需 CPU 干预;数据总线由I/O设备和DMA电路直接驱动。因此,当内核设置DMA操作时,它必须写入DMA或I/O设备的正确I/O端口所涉及的内存缓冲区的总线地址。
Why should the kernel be concerned at all about bus addresses ? Well, in a DMA operation, the data transfer takes place without CPU intervention; the data bus is driven directly by the I/O device and the DMA circuit. Therefore, when the kernel sets up a DMA operation, it must write the bus address of the memory buffer involved in the proper I/O ports of the DMA or I/O device.
在80×86架构中,总线地址与物理地址一致。然而,其他架构(例如Sun的SPARC和Hewlett-Packard的Alpha)包含称为I/O内存管理单元(IO-MMU)的硬件电路 ,类似于微处理器的分页单元,它将物理地址映射到总线地址。所有使用 DMA 的 I/O 驱动程序都必须在开始数据传输之前正确设置 IO-MMU。
In the 80 × 86 architecture, bus addresses coincide with physical addresses. However, other architectures such as Sun's SPARC and Hewlett-Packard's Alpha include a hardware circuit called the I/O Memory Management Unit (IO-MMU), analog to the paging unit of the microprocessor, which maps physical addresses into bus addresses. All I/O drivers that make use of DMAs must set up properly the IO-MMU before starting the data transfer.
不同的总线有不同的总线地址大小。例如,ISA 的总线地址是 24 位长,因此在 80 × 86 架构中,DMA 传输只能在物理内存的较低 16 MB 上完成,这就是必须为此类 DMA 使用的缓冲区分配内存的原因在ZONE_DMA带有GFP_DMA标志的内存区域中。最初的PCI标准定义了32位的总线地址;然而,一些PCI硬件设备最初是为ISA总线设计的,因此它们仍然无法访问物理地址以上的RAM位置0x00ffffff。最近的 PCI-X 标准使用 64 位总线地址并允许 DMA 电路直接寻址高端内存。
Different buses have different bus address sizes. For
instance, bus addresses for ISA are 24-bits long, thus in the 80 ×
86 architecture DMA transfers can be done only on the lower 16 MB of
physical memory—that's why the memory for the buffer used by such
DMA has to be allocated in the ZONE_DMA memory zone with the GFP_DMA flag. The original PCI standard
defines bus addresses of 32 bits; however, some PCI hardware devices
have been originally designed for the ISA bus, thus they still
cannot access RAM locations above physical address 0x00ffffff. The recent PCI-X standard uses
64-bit bus addresses and allows DMA circuits to address directly the
high memory.
在 Linux 中,该dma_addr_t
类型表示通用总线地址。在80×86架构中
dma_addr_t对应32位整数,除非内核支持PAE(参见第2章“物理地址扩展(PAE)分页机制”一节),这种情况下
对应64位整数。dma_addr_t
In Linux, the dma_addr_t
type represents a generic bus address. In the 80 × 86 architecture
dma_addr_t corresponds to a
32-bit integer, unless the kernel supports PAE (see the section
"The Physical Address
Extension (PAE) Paging Mechanism" in Chapter 2), in which case
dma_addr_t corresponds to a
64-bit integer.
和辅助函数检查总线是否接受给定大小的总线地址(掩码),如果是,则通知总线层给定的外设将使用该大小的总线地址pci_set_dma_mask( )。
dma_set_mask( )
The pci_set_dma_mask( ) and
dma_set_mask( ) helper functions
check whether the bus accepts a given size for the bus addresses
(mask) and, if so, notify the bus layer that the given peripheral
will use that size for its bus addresses.
系统架构不一定在硬件级别上提供硬件缓存和 DMA 电路之间的一致性协议,因此 DMA 辅助函数在实现 DMA 映射操作时必须考虑硬件缓存。要了解原因,假设设备驱动程序用一些数据填充内存缓冲区,然后立即指示硬件设备通过 DMA 传输读取该数据。如果 DMA 访问物理 RAM 位置,但相应的硬件高速缓存行尚未写入 RAM,则硬件设备会获取内存缓冲区的旧值。
The system architecture does not necessarily offer a coherency protocol between the hardware cache and the DMA circuits at the hardware level, so the DMA helper functions must take into consideration the hardware cache when implementing DMA mapping operations. To see why, suppose that the device driver fills the memory buffer with some data, then immediately instructs the hardware device to read that data with a DMA transfer. If the DMA accesses the physical RAM locations but the corresponding hardware cache lines have not yet been written to RAM, then the hardware device fetches the old values of the memory buffer.
设备驱动程序开发人员可以通过使用两类不同的辅助函数以两种不同的方式处理 DMA 缓冲区。使用 Linux 术语,开发人员可以在两种不同的DMA 映射类型之间进行选择 :
Device driver developers may handle DMA buffers in two different ways by making use of two different classes of helper functions. Using Linux terminology, the developer chooses between two different DMA mapping types :
当使用这种映射时,内核确保内存和硬件设备之间不会出现缓存一致性问题;这意味着 CPU 在 RAM 位置上执行的每个写操作对于硬件设备都是立即可见的,反之亦然。这种类型的映射也称为“同步”或“一致”。
When using this mapping, the kernel ensures that there will be no cache coherency problems between the memory and the hardware device; this means that every write operation performed by the CPU on a RAM location is immediately visible to the hardware device, and vice versa. This type of mapping is also called "synchronous" or "consistent."
使用此映射时,设备驱动程序必须通过使用正确的同步帮助器函数来处理缓存一致性问题。这种类型的映射也称为“异步”或“非相干”。
When using this mapping, the device driver must take care of cache coherency problems by using the proper synchronization helper functions. This type of mapping is also called "asynchronous" or "non-coherent."
在 80 × 86 架构中,使用 DMA 时永远不会出现高速缓存一致性问题,因为硬件设备本身负责“监听”对硬件高速缓存的访问。因此,专为 80 × 86 架构设计的硬件设备的驱动程序可以选择两种 DMA 映射类型中的任何一种:它们本质上是等效的。另一方面,在许多体系结构中(例如 MIPS、SPARC 和 PowerPC 的某些模型),硬件设备并不总是监听硬件缓存,因此会出现缓存一致性问题。一般来说,为独立于体系结构的驱动程序选择正确的 DMA 映射类型并非易事。
In the 80 × 86 architecture there are never cache coherency problems when using the DMA, because the hardware devices themselves take care of "snooping" the accesses to the hardware caches. Therefore, a driver for a hardware device designed specifically for the 80 × 86 architecture may choose either one of the two DMA mapping types: they are essentially equivalent. On the other hand, in many architectures—such as MIPS, SPARC, and some models of PowerPC—hardware devices do not always snoop in the hardware caches, so cache coherency problems arise. In general, choosing the proper DMA mapping type for an architecture-independent driver is not trivial.
作为一般规则,如果 CPU 和 DMA 处理器以不可预测的方式访问缓冲区,则必须强制执行一致 DMA 映射(例如,SCSI 适配器命令数据结构的缓冲区)。在其他情况下,流式 DMA 映射更可取,因为在某些架构中,处理相干 DMA 映射很麻烦,并且可能导致系统性能降低。
As a general rule, if the buffer is accessed in unpredictable ways by the CPU and the DMA processor, coherent DMA mapping is mandatory (for instance, buffers for SCSI adapters' command data structures). In other cases, streaming DMA mapping is preferable, because in some architectures handling the coherent DMA mapping is cumbersome and may lead to lower system performance.
通常,设备驱动程序在初始化阶段分配内存缓冲区并建立一致的DMA映射;卸载时它会释放映射和缓冲区。为了分配内存缓冲区并建立一致的 DMA 映射,内核提供了pci_alloc_consistent( )与体系结构相关的dma_alloc_coherent( )函数。它们都返回新缓冲区的线性地址和总线地址。在80×86架构中,它们返回线性地址和新缓冲区的物理地址。为了释放映射和缓冲区,内核提供了pci_free_consistent( )和dma_free_coherent( )函数。
Usually, the device driver allocates the memory buffer
and establishes the coherent DMA mapping in the initialization
phase; it releases the mapping and the buffer when it is unloaded.
To allocate a memory buffer and to establish a coherent DMA mapping,
the kernel provides the architecture-dependent pci_alloc_consistent( ) and dma_alloc_coherent( ) functions. They both
return the linear address and the bus address of the new buffer. In
the 80 × 86 architecture, they return the linear address and the
physical address of the new buffer. To release the mapping and the
buffer, the kernel provides the pci_free_consistent( ) and the dma_free_coherent( ) functions.
用于流 DMA 映射的内存缓冲区通常在传输之前进行映射,并在传输之后取消映射。也可以在多个 DMA 传输之间保持相同的映射,但在这种情况下,设备驱动程序开发人员必须了解位于内存和外设之间的硬件缓存。
Memory buffers for streaming DMA mappings are usually mapped just before the transfer and unmapped thereafter. It is also possible to keep the same mapping among several DMA transfers, but in this case the device driver developer must be aware of the hardware cache lying between the memory and the peripheral.
要设置流式 DMA 传输,驱动程序必须首先通过分区页帧分配器(请参阅第 8 章中的“分区页帧分配器”部分)或通用内存分配器(请参阅“分区页帧分配器”部分)动态分配内存缓冲区。通用对象”(第 8 章)。然后,驱动程序必须通过调用或函数来建立流 DMA 映射,该函数接收缓冲区的线性地址作为其参数并返回其总线地址。为了释放映射,驱动程序调用相应的
函数
。pci_map_single( )dma_map_single( )pci_unmap_single( )dma_unmap_single( )
To set up a streaming DMA transfer, the driver must first
dynamically allocate the memory buffer by means of the zoned page
frame allocator (see the section "The Zoned Page Frame
Allocator" in Chapter
8) or the generic memory allocator (see the section "General Purpose
Objects" in Chapter
8). Then, the drivers must establish the streaming DMA
mapping by invoking either the pci_map_single( ) or the dma_map_single( ) function, which receives
as its parameter the linear address of the buffer and returns its
bus address. To release the mapping, the driver invokes the
corresponding pci_unmap_single( )
or dma_unmap_single( )
functions.
为了避免缓存一致性问题,在开始从 RAM 到设备的 DMA 传输之前,驱动程序应调用
pci_dma_sync_single_for_device( )
或dma_sync_single_for_device( ),它会在必要时刷新与 DMA 缓冲区对应的缓存行。同样,设备驱动程序不应在从设备到 RAM 的 DMA 传输结束后立即访问内存缓冲区:相反,在读取缓冲区之前,驱动程序应调用 或 ,这会在必要时使相应的硬件缓存
pci_dma_sync_single_for_cpu( )行
dma_sync_single_for_cpu( )无效。在 80 × 86 架构中,这些函数几乎不执行任何操作,因为硬件缓存和 DMA 之间的一致性是由硬件维护的。
To avoid cache coherency problems, right before starting a DMA
transfer from the RAM to the device, the driver should invoke
pci_dma_sync_single_for_device( )
or dma_sync_single_for_device( ),
which flush, if necessary, the cache lines corresponding to the DMA
buffer. Similarly, a device driver should not access a memory buffer
right after the end of a DMA transfer from the device to the RAM:
instead, before reading the buffer, the driver should invoke
pci_dma_sync_single_for_cpu( ) or
dma_sync_single_for_cpu( ), which
invalidate, if necessary, the corresponding hardware cache lines. In
the 80 × 86 architecture, these functions do almost nothing, because
the coherency between hardware caches and DMAs is maintained by the
hardware.
甚至高端内存中的缓冲区(请参阅第 8 章中的“高端内存页帧的内核映射”部分)也可以用于 DMA 传输;开发人员使用“或”向其传递页面的描述符地址,包括缓冲区以及缓冲区在页面内的偏移量。相应地,为了释放高内存缓冲区的映射,开发人员使用or 。pci_map_page( )dma_map_page( )pci_unmap_page( )dma_unmap_page( )
Even buffers in high memory (see the section "Kernel Mappings of High-Memory
Page Frames" in Chapter
8) can be used for DMA transfers; the developer uses pci_map_page( )—or dma_map_page( )—passing to it the
descriptor address of the page including the buffer and the offset
of the buffer inside the page. Correspondingly, to release the
mapping of the high memory buffer, the developer uses pci_unmap_page( ) or dma_unmap_page( ).
Linux 内核并不完全支持所有可能的现有 I/O 设备。一般来说,在事实上,硬件设备有三种可能的支持:
The Linux kernel does not fully support all possible existing I/O devices. Generally speaking, in fact, there are three possible kinds of support for a hardware device:
The application program interacts directly with the
device's I/O ports by issuing suitable in and out assembly language instructions.
内核无法识别硬件设备,但可以识别其 I/O 接口。用户程序能够将接口视为能够读取和/或写入字符序列的顺序设备。
The kernel does not recognize the hardware device, but does recognize its I/O interface. User programs are able to treat the interface as a sequential device capable of reading and/or writing sequences of characters.
内核识别硬件设备并自行处理 I/O 接口。事实上,该设备甚至可能没有设备文件。
The kernel recognizes the hardware device and handles the I/O interface itself. In fact, there might not even be a device file for the device.
第一种方法的最常见示例(不依赖于任何内核设备驱动程序)是 X Window 系统如何传统上处理图形显示。这是相当高效的,尽管它限制了 X 服务器使用 I/O 设备发出的硬件中断。此方法还需要一些额外的工作来允许 X 服务器访问所需的 I/O 端口。正如第3章“任务状态段”
一节中提到的,iopl( ) 和ioperm( )
系统调用授予进程访问 I/O 端口的权限。它们只能由具有 root 权限的程序调用。但是,可以通过设置可执行文件的setuid标志来使用户可以使用此类程序(请参阅第 20 章中的“进程凭据和功能”部分)。
The most common example of the first approach, which does not
rely on any kernel device driver, is how the X Window System traditionally handles the graphic display. This is
quite efficient, although it constrains the X server from using the
hardware interrupts issued by the I/O device. This approach also
requires some additional effort to allow the X server to access the
required I/O ports. As mentioned in the section "Task State Segment" in
Chapter 3, the iopl( ) and ioperm( )
system calls grant a process the privilege to access
I/O ports. They can be invoked only by programs having root
privileges. But such programs can be made available to users by
setting the setuid flag of the executable file
(see the section "Process
Credentials and Capabilities" in Chapter 20).
最新的 Linux 版本支持多种广泛使用的显卡。/dev/fb设备文件为图形卡的帧缓冲区提供了抽象,并允许应用程序软件访问它,而无需了解有关图形接口的 I/O 端口的任何信息。此外,内核支持直接渲染基础设施 (DRI),允许应用软件利用加速 3D 显卡的硬件。无论如何,传统的DIY X Window System 服务器仍然被广泛采用。
Recent Linux versions support several widely used graphic cards. The /dev/fb device file provides an abstraction for the frame buffer of the graphic card and allows application software to access it without needing to know anything about the I/O ports of the graphics interface. Furthermore, the kernel supports the Direct Rendering Infrastructure (DRI) that allows application software to exploit the hardware of accelerated 3D graphics cards. In any case, the traditional do-it-yourself X Window System server is still widely adopted.
最小支持方法用于处理连接到通用 I/O 接口的外部硬件设备。内核通过提供设备文件(以及设备驱动程序)来处理 I/O 接口;应用程序通过读写设备文件来处理外部硬件设备。
The minimal support approach is used to handle external hardware devices connected to a general-purpose I/O interface. The kernel takes care of the I/O interface by offering a device file (and thus a device driver); the application program handles the external hardware device by reading and writing the device file.
最小支持优于扩展支持,因为它可以保持较小的内核大小。然而,在PC上常见的通用I/O接口中,只有串行端口和并行端口可以用这种方法处理。因此,串行鼠标直接由应用程序控制,例如X服务器,而串行调制解调器总是需要通信程序,例如Minicom、Seyon或点对点协议(PPP)守护程序。
Minimal support is preferable to extended support because it keeps the kernel size small. However, among the general-purpose I/O interfaces commonly found on a PC, only the serial port and the parallel port can be handled with this approach. Thus, a serial mouse is directly controlled by an application program, such as the X server, and a serial modem always requires a communication program, such as Minicom, Seyon, or a Point-to-Point Protocol (PPP) daemon.
最小支持的应用范围有限,因为当外部设备必须与内部内核数据结构进行大量交互时,它无法使用。例如,考虑连接到通用 I/O 接口的可移动硬盘。应用程序无法与识别磁盘和挂载其文件系统所需的所有内核数据结构和函数进行交互,因此在这种情况下必须提供扩展支持。
Minimal support has a limited range of applications, because it cannot be used when the external device must interact heavily with internal kernel data structures. For example, consider a removable hard disk that is connected to a general-purpose I/O interface. An application program cannot interact with all kernel data structures and functions needed to recognize the disk and to mount its filesystem, so extended support is mandatory in this case.
一般来说,每个直接连接到I/O总线的硬件设备,例如内部硬盘,都是根据扩展支持方法来处理的:内核必须为每个这样的设备提供一个设备驱动程序。连接到通用串行总线 (USB)、许多笔记本电脑中的 PCMCIA 端口或 SCSI 接口(简而言之,除了串行和并行端口之外的每个通用 I/O 接口)的外部设备也需要扩展支持。
In general, every hardware device directly connected to the I/O bus, such as the internal hard disk, is handled according to the extended support approach: the kernel must provide a device driver for each such device. External devices attached to the Universal Serial Bus (USB), the PCMCIA port found in many laptops, or the SCSI interface—in short, every general-purpose I/O interface except the serial and the parallel ports—also require extended support.
值得注意的是,标准的文件相关系统调用例如open( ) ,read( ) , 和write( )
不要总是让应用程序完全控制底层硬件设备。事实上,VFS 的最小公分母方法不包括某些设备需要的特殊命令或让应用程序检查设备是否处于特定内部状态的空间。
It is worth noting that the standard file-related system calls
such as open( ) , read( ) , and write( )
do not always give the application full control of the
underlying hardware device. In fact, the lowest-common-denominator
approach of the VFS does not include room for special commands that
some devices need or let an application check whether the device is in
a specific internal state.
这ioctl( ) 系统调用就是为了满足这样的需求而引入的。除了设备文件的文件描述符和指定请求的第二个 32 位参数之外,系统调用还可以接受任意数量的附加参数。例如,
ioctl( )存在获取 CD-ROM 音量或弹出 CD-ROM 介质的特定请求。应用程序可以使用这些类型的请求来提供CD播放器的用户界面
ioctl( )。
The ioctl( ) system call was introduced to satisfy such needs.
Besides the file descriptor of the device file and a second 32-bit
parameter specifying the request, the system call can accept an
arbitrary number of additional parameters. For example, specific
ioctl( ) requests exist to get the
CD-ROM sound volume or to eject the CD-ROM media. Application programs
may provide the user interface of a CD player using these kinds of
ioctl( ) requests.
处理字符设备相对容易,因为通常不需要复杂的缓冲策略并且不涉及磁盘缓存。当然,字符设备的要求有所不同:其中一些必须实现复杂的通信协议来驱动硬件设备,而另一些则只需从硬件设备的几个 I/O 端口读取一些值。例如,多端口串行卡设备(提供多个串行端口的硬件设备)的设备驱动程序比总线鼠标的设备驱动程序复杂得多。
Handling a character device is relatively easy, because usually sophisticated buffering strategies are not needed and disk caches are not involved. Of course, character devices differ in their requirements: some of them must implement a sophisticated communication protocol to drive the hardware device, while others just have to read a few values from a couple of I/O ports of the hardware devices. For instance, the device driver of a multiport serial card device (a hardware device offering many serial ports) is much more complicated than the device driver of a bus mouse.
另一方面,块设备驱动程序本质上比字符设备驱动程序更复杂。事实上,应用程序有权重复请求读取或写入同一数据块。此外,对这些设备的访问通常非常慢。这些特性对磁盘驱动器的结构具有深远的影响。然而,正如我们将在下一章中看到的,内核提供了复杂的组件(例如页面缓存和块 I/O 子系统)来处理它们。在本章的其余部分中,我们将重点关注字符设备驱动程序。
Block device drivers, on the other hand, are inherently more complex than character device drivers . In fact, applications are entitled to ask repeatedly to read or write the same block of data. Furthermore, accesses to these devices are usually very slow. These peculiarities have a profound impact on the structure of the disk drivers. As we 'll see in the next chapters, however, the kernel provides sophisticated components—such as the page cache and the block I/O subsystem—to handle them. In the rest of this chapter we focus our attention on the character device drivers.
字符设备驱动程序由一个cdev结构体来描述,其字段列于
表13-8中。
A character device driver is described by a cdev structure, whose fields are listed in
Table 13-8.
表 13-8。cdev结构体的字段
Table 13-8. The fields of the cdev structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 嵌入式对象 Embedded kobject |
| | 指向实现驱动程序的模块的指针(如果有) Pointer to the module implementing the driver, if any |
| | 指向设备驱动程序文件操作表的指针 Pointer to the file operations table of the device driver |
| | 与该字符设备的设备文件相关的 inode 列表的头部 Head of the list of inodes relative to device files for this character device |
| | 分配给设备驱动程序的初始主要编号和次要编号 Initial major and minor numbers assigned to the device driver |
| | 分配给设备驱动程序的设备编号范围的大小 Size of the range of device numbers assigned to the device driver |
该list字段是双向链接循环列表的头部,收集引用同一字符设备驱动程序的字符设备文件的 inode。可能有许多具有相同设备号的设备文件,并且所有这些设备文件都引用相同的字符设备。此外,设备驱动程序可以与一系列设备编号相关联,而不仅仅是单个设备编号;所有编号在该范围内的设备文件都由相同的字符设备驱动程序处理。范围的大小存储在count字段中。
The list field is the head of a
doubly linked circular list collecting inodes of character device files
that refer to the same character device driver. There could be many
device files having the same device number, and all of them refer to the
same character device. Moreover, a device driver can be associated with
a range of device numbers, not just a single one; all device files whose
numbers fall in the range are handled by the same character device
driver. The size of the range is stored in the count field.
该cdev_alloc( )函数动态分配一个cdev
描述符并初始化嵌入的kobject,以便当引用计数器变为零时描述符自动释放。
The cdev_alloc( ) function
allocates dynamically a cdev
descriptor and initializes the embedded kobject so that the descriptor
is automatically freed when the reference counter becomes zero.
该函数在设备驱动程序模型中cdev_add( )注册一个描述符。cdev该函数初始化描述符的dev和count字段cdev,然后调用该kobj_map( )函数。该函数反过来设置设备驱动程序模型的数据结构,将设备编号的间隔粘合到设备驱动程序描述符。
The cdev_add( ) function
registers a cdev descriptor in the
device driver model. The function initializes the dev and count fields of the cdev descriptor, then invokes the kobj_map( ) function. This function, in turn,
sets up the device driver model's data structures that glue the interval
of device numbers to the device driver descriptor.
设备驱动模型定义了一个kobject映射域 对于字符设备,由类型描述符表示kobj_map并由全局变量引用cdev_map。该kobj_map描述符包括一个包含 255 个条目的哈希表,该哈希表按间隔的主编号进行索引。哈希表存储 类型的对象probe,每个注册的主次编号范围都有一个对象,其字段列于表 13-9中。
The device driver model defines a kobject mapping
domain for the character devices, which is represented by a
descriptor of type kobj_map and is
referenced by the cdev_map global
variable. The kobj_map descriptor
includes a hash table of 255 entries indexed by the major number of the
intervals. The hash table stores objects of type probe, one for each registered range of major
and minor numbers, whose fields are listed in Table 13-9.
表 13-9。探测对象的字段
Table 13-9. The fields of the probe object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 哈希冲突列表中的下一个元素 Next element in hash collision list |
| | 间隔的初始设备号(主要和次要) Initial device number (major and minor) of the interval |
| | 区间大小 Size of the interval |
| | 指向实现设备驱动程序的模块的指针(如果有) Pointer to the module that implements the device driver, if any |
| | 探测区间所有者的方法 Method for probing the owner of the interval |
| | 增加区间所有者引用计数器的方法 Method for increasing the reference counter of the owner of the interval |
| | 区间所有者的私有数据 Private data for the owner of the interval |
调用该函数时kobj_map( ),指定间隔的设备编号将添加到哈希表中。data相应对象的字段指向probe设备cdev驱动程序的描述符。该字段的值在执行时传递给get和方法。lock在这种情况下,该get方法由一个短函数实现,该函数返回嵌入在cdev描述符中的kobject的地址;相反,该lock方法实质上增加了嵌入式 kobject 中的引用计数器。
When the kobj_map( ) function
is invoked, the specified interval of device numbers is added to the
hash table. The data field of the
corresponding probe object points to
the cdev descriptor of the device
driver. The value of this field is passed to the get and lock methods when they are executed. In this
case, the get method is implemented
by a short function that returns the address of the kobject embedded in
the cdev descriptor; the lock method, instead, essentially increases
the reference counter in the embedded kobject.
该kobj_lookup( )函数接收一个kobject映射域和一个设备号作为输入参数;它搜索哈希表并返回该区间所有者的 kobject 地址,包括编号(如果找到)。cdev当应用于字符设备的映射域时,该函数返回嵌入在拥有设备号区间的设备驱动程序的描述符中的kobject的地址。
The kobj_lookup( ) function
receives as input parameters a kobject mapping domain and a device
number; it searches the hash table and returns the address of the
kobject of the owner of the interval including the number, if it was
found. When applied to the mapping domain of the character devices, the
function returns the address of the kobject embedded in the cdev descriptor of the device driver that owns
the interval of device numbers.
为了跟踪当前分配的字符设备号,内核使用哈希表chrdevs,其中包含设备号的间隔。两个间隔可以共享相同的主编号,但它们不能重叠,因此它们的次编号应该全部不同。该表包含 255 个条目,哈希函数屏蔽了主编号的 4 个高位,因此,小于 255 的主编号会被哈希到不同的条目中。每个条目都指向按主要编号和次要编号递增排序的冲突列表的第一个元素。
To keep track of which character device numbers are
currently assigned, the kernel uses a hash table chrdevs, which contains intervals of device
numbers. Two intervals may share the same major number, but they
cannot overlap, thus their minor numbers should be all different. The
table includes 255 entries, and the hash function masks out the four
higher-order bits of the major number—therefore, major numbers less
than 255 are hashed in different entries. Each entry points to the
first element of a collision list ordered by increasing major and
minor numbers.
每个列表元素都是一个结构体,其字段如表13-10char_device_struct所示。
Each list element is a char_device_struct structure, whose fields
are shown in Table
13-10.
表 13-10。char_device_struct 描述符的字段
Table 13-10. The fields of the char_device_struct descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向哈希冲突列表中下一个元素的指针 The pointer to next element in hash collision list |
| | 区间的主数 The major number of the interval |
| | 间隔的初始次要编号 The initial minor number of the interval |
| | 间隔大小 The interval size |
| | 处理间隔的设备驱动程序的名称 The name of the device driver that handles the interval |
| | 不曾用过 Not used |
| | 指向字符设备驱动程序描述符的指针 Pointer to the character device driver descriptor |
本质上有两种方法可以将一系列设备编号分配给字符设备驱动程序。第一种方法应该用于所有新的设备驱动程序,它依赖于register_chrdev_region( )和alloc_chrdev_region( )函数,并分配任意范围的设备编号。dev_t例如,要获取从valuedev和 size开始的数字区间size:
There are essentially two methods for assigning a range of
device numbers to a character device driver. The first method, which
should be used for all new device drivers, relies on the register_chrdev_region( ) and alloc_chrdev_region( ) functions, and
assigns an arbitrary range of device numbers. For instance, to get an
interval of numbers starting from the dev_t value dev and of size size:
register_chrdev_region(dev, 大小, "foo");
register_chrdev_region(dev, size, "foo");
这些函数不会执行cdev_add( ),因此设备驱动程序必须cdev_add( )在成功分配请求的时间间隔后执行。
These functions do not execute cdev_add( ), so the device driver must
execute cdev_add( ) after the
requested interval has been successfully assigned.
第二种方法利用该register_chrdev( )函数并分配固定间隔的设备编号,包括单个主编号和从 0 到 255 的次编号。在这种情况下,设备驱动程序不得调用cdev_add( ).
The second method makes use of the register_chrdev( ) function and assigns a
fixed interval of device numbers including a single major number and
minor numbers from 0 to 255. In this case, the device driver must not
invoke cdev_add( ).
该register_chrdev_region(
)函数接收三个参数:初始设备号(主设备号和次设备号)、请求的设备号范围的大小(作为次设备号的数量)以及请求设备号的设备驱动程序的名称。该函数检查请求的范围是否跨越多个主号码,如果是,则确定覆盖整个范围的主号码和相应的间隔;然后,该函数_ _register_chrdev_region( )在每个间隔上调用(如下所述)。
The register_chrdev_region(
) function receives three parameters: the initial device
number (major and minor numbers), the size of the requested range of
device numbers (as the number of minor numbers), and the name of the
device driver that is requesting the device numbers. The function
checks whether the requested range spans several major numbers and,
if so, determines the major numbers and the corresponding intervals
that cover the whole range; then, the function invokes _ _register_chrdev_region( ) (described
below) on each of these intervals.
功能alloc_chrdev_region( )
类似,但用于动态分配主号码;因此,它接收间隔的初始次要编号、间隔的大小和设备驱动程序的名称作为其参数。该函数最终也会调用_ _register_chrdev_region( ).
The alloc_chrdev_region( )
function is similar, but it is used to allocate dynamically a major
number; thus, it receives as its parameters the initial minor number
of the interval, the size of the interval, and the name of the
device driver. This function also ends up invoking _ _register_chrdev_region( ).
该_ _register_chrdev_region(
)函数执行以下步骤:
The _ _register_chrdev_region(
) function executes the following steps:
分配一个新char_device_struct结构,并用零填充它。
Allocates a new char_device_struct structure, and
fills it with zeros.
如果间隔的主编号为零,则设备驱动程序已请求动态分配主编号。该函数从哈希表的最后一个条目开始向后查找一个空的冲突列表(NULL指针),该列表对应于一个尚未使用的主编号。如果没有找到空条目,该函数将返回错误代码。[ * ]
If the major number of the interval is zero, then the
device driver has requested the dynamic allocation of the major
number. Starting from the last hash table entry and proceeding
backward, the function looks for an empty collision list
(NULL pointer), which
corresponds to a yet unused major number. If no empty entry is
found, the function returns an error code.[*]
char_device_struct使用间隔的初始设备号、间隔大小和设备驱动程序的名称来初始化该结构的字段。
Initializes the fields of the char_device_struct structure with the
initial device number of the interval, the interval size, and
the name of the device driver.
执行哈希函数来计算主编号对应的哈希表索引。
Executes the hash function to compute the hash table index corresponding to the major number.
遍历碰撞列表,寻找新char_device_struct
结构的正确位置。同时,如果发现与请求的间隔重叠,则返回错误代码。
Walks the collision list, looking for the correct position
of the new char_device_struct
structure. Meanwhile, if an interval overlapping with the
requested one is found, it returns an error code.
将新char_device_struct描述符插入冲突列表中。
Inserts the new char_device_struct descriptor in the
collision list.
返回新描述符的地址char_device_struct。
Returns the address of the new char_device_struct descriptor.
该register_chrdev(
)函数由需要旧式设备号间隔的驱动程序使用:单个主设备号和范围从 0 到 255 的次设备号。该函数接收请求的主设备号(动态分配为零)、设备名称作为其参数major
。设备驱动程序
,以及指向特定于间隔中的字符设备文件的文件操作表的name指针。fops它执行以下操作:
The register_chrdev(
) function is used by drivers that require an old-style
interval of device numbers: a single major number and minor numbers
ranging from 0 to 255. The function receives as its parameters the
requested major number major
(zero for dynamic allocation), the name of the device driver
name, and a pointer fops to a table of file operations
specific to the character device files in the interval. It executes
the following operations:
调用_
_register_chrdev_region( )函数来分配请求的间隔。如果函数返回错误代码(无法指定间隔),则函数终止。
Invokes the _
_register_chrdev_region( ) function to allocate the
requested interval. If the function returns an error code (the
interval cannot be assigned), it terminates.
cdev
为设备驱动程序分配新的结构。
Allocates a new cdev
structure for the device driver.
初始化cdev
结构:
将嵌入的 kobject 的类型设置为类型描述符(请参阅前面的“ Kobjectsktype_cdev_dynamic ”部分)。
owner
用 的内容设置字段fops->owner。
设置具有文件操作表ops地址的字段。fops
将设备驱动程序名称的字符复制到name嵌入 kobject 的字段中。
Initializes the cdev
structure:
Sets the type of the embedded kobject to the ktype_cdev_dynamic type descriptor
(see the earlier section "Kobjects").
Sets the owner
field with the contents of fops->owner.
Sets the ops field
with the address fops of
the table of file operations.
Copies the characters of the device driver name into
the name field of the
embedded kobject.
调用cdev_add( )
函数(前面已解释过)。
Invokes the cdev_add( )
function (explained previously).
将步骤 1 返回的描述符cdev的字段设置为设备驱动程序的描述符的地址。char_device_struct_ _register_chrdev_region(
)cdev
Sets the cdev field of
the char_device_struct
descriptor _ _register_chrdev_region(
) returned in step 1 with the address of the cdev descriptor of the device
driver.
返回指定间隔的主编号。
Returns the major number of the assigned interval.
我们在前面的“设备文件的VFS处理”一节中提到过,该dentry_open(
)函数由open(
) 系统调用服务例程自定义f_op字符设备文件的文件对象中的字段,使其指向该def_chr_fops表。这张桌子几乎是空的;它仅将chrdev_open(
)函数定义为open
设备文件的方法。该方法立即由 调用
dentry_open( )。
We mentioned in the earlier section "VFS Handling of Device
Files" that the dentry_open(
) function triggered by the open(
) system call service routine customizes the f_op field in the file object of the
character device file so that it points to the def_chr_fops table. This table is almost
empty; it only defines the chrdev_open(
) function as the open
method of the device file. This method is immediately invoked by
dentry_open( ).
该函数接收相对于正在打开的设备文件的索引节点和文件对象的chrdev_open( )地址inode和地址作为其参数。filp它主要执行以下操作:
The chrdev_open( ) function
receives as its parameters the addresses inode and filp of the inode and file objects relative
to the device file being opened. It executes essentially the following
operations:
检查inode->i_cdev指向设备驱动程序cdev描述符的指针。如果该字段不是NULL,则 inode 已被访问:增加cdev描述符的引用计数器并跳转到步骤 6。
Checks the inode->i_cdev pointer to the device
driver's cdev descriptor. If
this field is not NULL, then
the inode has already been accessed: increases the reference
counter of the cdev descriptor
and jumps to step 6.
调用kobj_lookup(
)函数来搜索包含数字的区间。如果这样的区间不存在,则返回错误码;否则,它计算与该间隔关联的描述符的地址cdev。
Invokes the kobj_lookup(
) function to search the interval including the number.
If such interval does not exists, it returns an error code;
otherwise, it computes the address of the cdev descriptor associated with the
interval.
将inode->i_cdev
inode 对象的字段设置为描述符的地址cdev。
Sets the inode->i_cdev
field of the inode object to the address of the cdev descriptor.
将该inode->i_cindex字段设置为设备驱动程序间隔内设备号的相对索引(索引 0 表示间隔中的第一个次要编号,1 表示第二个次要编号,依此类推)。
Sets the inode->i_cindex field to the relative
index of the device number inside the interval of the device
driver (index zero for the first minor number in the interval, one
for the second, and so on).
list将 inode 对象添加到描述符字段指向的列表中
cdev。
Adds the inode object into the list pointed to by the
list field of the cdev descriptor.
使用描述符字段filp->f_ops的内容初始化文件操作指针
。opscdev
Initializes the filp->f_ops file operations pointer
with the contents of the ops
field of the cdev
descriptor.
如果filp->f_ops->open定义了方法,则函数将执行它。如果设备驱动程序处理多个设备号,通常该函数会再次设置文件对象的文件操作,以安装适合所访问的设备文件的文件操作。
If the filp->f_ops->open method is
defined, the function executes it. If the device driver handles
more than one device number, typically this function sets the file
operations of the file object once again, so as to install the
file operations suitable for the accessed device file.
通过返回零终止(成功)。
Terminates by returning zero (success).
传统上,类 Unix 操作系统将硬件设备分为块设备和字符设备。然而,这种分类并不能说明全部情况。有些设备能够在单个 I/O 操作中传输大量数据,而其他设备只能传输几个字符。
Traditionally, Unix-like operating systems divide hardware devices into block and character devices. However, this classification does not tell the whole story. Some devices are capable of transferring sizeable amounts of data in a single I/O operation, while others transfer only a few characters.
例如,PS/2 鼠标驱动程序在每个读取操作中获取与鼠标按钮的状态和鼠标指针在屏幕上的位置相对应的几个字节。这种设备是最容易操作的。输入数据首先从设备输入寄存器一次读取一个字符,并存储在适当的内核数据结构中;然后,数据会被闲暇时复制到进程地址空间中。类似地,输出数据首先从进程地址空间复制到适当的内核数据结构,然后一次写入一个到 I/O 设备输出寄存器。显然,此类设备的 I/O 驱动程序不使用 DMA,
For instance, a PS/2 mouse driver gets a few bytes in each read operation corresponding to the status of the mouse button and to the position of the mouse pointer on the screen. This kind of device is the easiest to handle. Input data is first read one character at a time from the device input register and stored in a proper kernel data structure; the data is then copied at leisure into the process address space. Similarly, output data is first copied from the process address space to a proper kernel data structure and then written one at a time into the I/O device output register. Clearly, I/O drivers for such devices do not use the DMA, because the CPU time spent to set up a DMA I/O operation is comparable to the time spent to move the data to or from the I/O ports.
另一方面,内核还必须准备好处理在每次 I/O 操作中产生大量字节的设备,无论是顺序设备(例如声卡或网卡)还是随机访问设备(例如各种磁盘) (软盘、CD-ROM、SCSI 磁盘等)。
On the other hand, the kernel must also be ready to deal with devices that yield a large number of bytes in each I/O operation, either sequential devices such as sound cards or network cards, or random access devices such as disks of all kinds (floppy, CD-ROM, SCSI disk, etc.).
例如,假设您已设置计算机的声卡,以便能够录制来自麦克风的声音。声卡以固定速率(例如 44.14 kHz)对来自麦克风的电信号进行采样,并生成分为输入数据块的 16 位数字流。声卡驱动程序必须能够在所有可能的情况下应对这种数据雪崩,即使 CPU 暂时忙于运行其他进程也是如此。
Suppose, for instance, that you have set up the sound card of your computer so that you are able to record sounds coming from a microphone. The sound card samples the electrical signal coming from the microphone at a fixed rate, say 44.14 kHz, and produces a stream of 16-bit numbers divided into blocks of input data. The sound card driver must be able to cope with this avalanche of data in all possible situations, even when the CPU is temporarily busy running some other process.
这可以通过结合两种不同的技术来完成:
This can be done by combining two different techniques:
使用 DMA 传输数据块。
Use of DMA to transfer blocks of data.
使用两个或多个元素的循环缓冲区,每个元素具有数据块的大小。当发生中断,表明已读取新数据块时,中断处理程序会将指针移至循环缓冲区的元素,以便将更多数据存储在空元素中。相反,每当驱动程序成功地将数据块复制到用户地址空间时,它就会释放循环缓冲区的一个元素,以便它可用于保存来自硬件设备的新数据。
Use of a circular buffer of two or more elements, each element having the size of a block of data. When an interrupt occurs signaling that a new block of data has been read, the interrupt handler advances a pointer to the elements of the circular buffer so that further data will be stored in an empty element. Conversely, whenever the driver succeeds in copying a block of data into user address space, it releases an element of the circular buffer so that it is available for saving new data from the hardware device.
循环缓冲区的作用是平滑CPU负载的峰值;即使用户模式应用程序接收数据的速度由于其他优先级较高的任务而减慢,DMA 也能够继续填充循环缓冲区的元素,因为中断处理程序代表当前正在运行的进程执行。
The role of the circular buffer is to smooth out the peaks of CPU load; even if the User Mode application receiving the data is slowed down because of other higher-priority tasks, the DMA is able to continue filling elements of the circular buffer because the interrupt handler executes on behalf of the currently running process.
从网卡接收数据包时会发生类似的情况,只不过在这种情况下,传入数据流是异步的。数据包的接收彼此独立,并且两个连续数据包到达之间发生的时间间隔是不可预测的。
A similar situation occurs when receiving packets from a network card, except that in this case, the flow of incoming data is asynchronous. Packets are received independently from each other and the time interval that occurs between two consecutive packet arrivals is unpredictable.
综合考虑,顺序设备的缓冲区处理很容易,因为相同的缓冲区永远不会重复使用:音频应用程序无法要求麦克风重新传输相同的数据块。
All considered, buffer handling for sequential devices is easy because the same buffer is never reused: an audio application cannot ask the microphone to retransmit the same block of data.
我们将在第 15 章中看到,随机访问设备(所有类型的磁盘)的缓冲要复杂得多。
We'll see in Chapter 15 that buffering for random access devices (all kinds of disks) is much more complicated.
[ * ]请注意,内核只能动态分配小于 255 的主编号,并且在某些情况下,即使存在未使用的小于 255 的主编号,分配也可能会失败。我们可能期望这些限制将来会被删除。
[*] Notice that the kernel can dynamically allocate only major numbers less than 255, and that in some cases allocation can fail even if there is a unused major number less than 255. We might expect that these constraints will be removed in the future.
本章讨论块设备(即各种磁盘)的 I/O 驱动程序。块设备的关键方面是CPU和总线读取或写入数据所花费的时间与磁盘硬件的速度之间的差异。块设备的平均访问时间非常高。每个操作需要几毫秒才能完成,主要是因为磁盘控制器必须移动磁盘表面上的磁头才能到达记录数据的准确位置。然而,当磁头正确放置时,数据传输可以维持每秒数十兆字节的速率。
This chapter deals with I/O drivers for block devices, i.e., for disks of every kind. The key aspect of a block device is the disparity between the time taken by the CPU and buses to read or write data and the speed of the disk hardware. Block devices have very high average access times. Each operation requires several milliseconds to complete, mainly because the disk controller must move the heads on the disk surface to reach the exact position where the data is recorded. However, when the heads are correctly placed, data transfer can be sustained at rates of tens of megabytes per second.
Linux 块设备处理程序的组织相当复杂。我们无法详细讨论内核块 I/O 子系统中包含的所有功能;然而,我们将概述一般的软件架构。和上一章一样,我们的目标是解释Linux如何支持块设备驱动程序的实现,而不是展示如何实现其中之一。
The organization of Linux block device handlers is quite involved. We won't be able to discuss in detail all the functions that are included in the block I/O subsystem of the kernel; however, we'll outline the general software architecture. As in the previous chapter, our objective is to explain how Linux supports the implementation of block device drivers , rather than showing how to implement one of them.
我们从第一节“块设备处理”开始解释Linux块I/O子系统的总体架构。在“通用块层”、“ I/O 调度程序”和“块设备驱动程序”部分中,我们将描述块 I/O 子系统的主要组件。最后,在最后一节“打开块设备文件”中,我们将概述打开块设备文件时内核执行的步骤。
We start in the first section "Block Devices Handling" to explain the general architecture of the Linux block I/O subsystem. In the sections "The Generic Block Layer," "The I/O Scheduler," and "Block Device Drivers," we will describe the main components of the block I/O subsystem. Finally, in the last section, "Opening a Block Device File," we will outline the steps performed by the kernel when opening a block device file.
块设备驱动程序上的每个操作都涉及大量的内核组件;最重要的如图 14-1所示。
Each operation on a block device driver involves a large number of kernel components; the most important ones are shown in Figure 14-1.
例如,让我们假设一个进程发出了read( ) 对某些磁盘文件的系统调用 - 我们将看到写请求基本上以相同的方式处理。以下是内核通常执行的服务进程请求的操作:
Let us suppose, for instance, that a process issued a read( ) system call on some disk file—we'll see that write
requests are handled essentially in the same way. Here is what the
kernel typically does to service the process request:
系统调用的服务例程read(
)激活合适的VFS函数,向其传递文件描述符和文件内的偏移量。虚拟文件系统
是块设备处理架构的上层,它提供了Linux支持的所有文件系统采用的通用文件模型。我们在第 12 章详细描述了 VFS 层 。
The service routine of the read(
) system call activates a suitable VFS function, passing
to it a file descriptor and an offset inside the file. The Virtual
Filesystem
is the upper layer of the block device handling architecture, and it provides a common file model adopted by all filesystems supported by Linux. We have described at length the VFS layer in Chapter 12.
VFS 函数确定所请求的数据是否已经可用,以及如果需要,如何执行读取操作。有时不需要访问磁盘上的数据,因为内核将最近从块设备读取或写入块设备的数据保存在 RAM 中。第15章解释了磁盘缓存机制,第16章详细介绍了VFS如何处理磁盘操作以及它如何与磁盘缓存和文件系统交互。
The VFS function determines if the requested data is already available and, if necessary, how to perform the read operation. Sometimes there is no need to access the data on disk, because the kernel keeps in RAM the data most recently read from—or written to—a block device. The disk cache mechanism is explained in Chapter 15, while details on how the VFS handles the disk operations and how it interfaces with the disk cache and the filesystems are given in Chapter 16.
我们假设内核必须从块设备读取数据,因此它必须确定该数据的物理位置。为此,内核依赖于映射层 ,通常执行两个步骤:
它确定包括文件在内的文件系统的块大小,并根据文件块数计算请求数据的范围 。本质上,文件被视为分割成许多块,并且内核确定包含所请求数据的块的编号(相对于文件开头的索引)。
接下来,映射层调用特定于文件系统的函数,该函数访问文件的磁盘索引节点并根据 逻辑块号确定所请求的数据在磁盘上的位置。本质上,磁盘被视为分成块,内核确定与存储请求数据的块相对应的编号(相对于磁盘或分区的开头的索引)。由于文件可能存储在磁盘上不相邻的块中,因此存储在磁盘索引节点中的数据结构将每个文件块号映射到逻辑块号。[ * ]
我们将在第 16 章中看到映射层的实际应用,而我们将在第 18 章中介绍一些典型的基于磁盘的文件系统。
Let's assume that the kernel must read the data from the block device, thus it must determine the physical location of that data. To do this, the kernel relies on the mapping layer , which typically executes two steps:
It determines the block size of the filesystem including the file and computes the extent of the requested data in terms of file block numbers . Essentially, the file is seen as split in many blocks, and the kernel determines the numbers (indices relative to the beginning of file) of the blocks containing the requested data.
Next, the mapping layer invokes a filesystem-specific function that accesses the file's disk inode and determines the position of the requested data on disk in terms of logical block numbers. Essentially, the disk is seen as split in blocks, and the kernel determines the numbers (indices relative to the beginning of the disk or partition) corresponding to the blocks storing the requested data. Because a file may be stored in nonadjacent blocks on disk, a data structure stored in the disk inode maps each file block number to a logical block number.[*]
We will see the mapping layer in action in Chapter 16, while we will present some typical disk-based filesystems in Chapter 18.
内核现在可以在块设备上发出读取操作。它利用通用块层 ,它启动传输请求数据的 I/O 操作。一般来说,每个 I/O 操作都涉及磁盘上相邻的一组块。由于请求的数据不一定在磁盘上相邻,因此通用块层可能会启动多个 I/O 操作。每个I/O操作都由一个“块I/O”(简称“bio”)结构来表示,它收集下层组件满足请求所需的所有信息。
通用块层隐藏了每个硬件块设备的特性,从而提供了块设备的抽象视图。由于几乎所有块设备都是磁盘,因此通用块层还提供了一些描述“磁盘”和“磁盘分区”的通用数据结构。我们将在本章后面的“通用块层”部分讨论通用块层和生物结构。
The kernel can now issue the read operation on the block device. It makes use of the generic block layer , which starts the I/O operations that transfer the requested data. In general, each I/O operation involves a group of blocks that are adjacent on disk. Because the requested data is not necessarily adjacent on disk, the generic block layer might start several I/O operations. Each I/O operation is represented by a "block I/O" (in short, "bio") structure, which collects all information needed by the lower components to satisfy the request.
The generic block layer hides the peculiarities of each hardware block device, thus offering an abstract view of the block devices. Because almost all block devices are disks, the generic block layer also provides some general data structures that describe "disks" and "disk partitions." We will discuss the generic block layer and the bio structure in the section "The Generic Block Layer" later in this chapter.
在通用块层下面,“I/O 调度程序”根据预定义的内核策略对待处理的 I/O 数据传输请求进行排序。调度程序的目的是将物理介质上彼此靠近的数据请求分组。我们将在本章后面的“ I/O 调度程序”部分中描述该组件。
Below the generic block layer, the "I/O scheduler " sorts the pending I/O data transfer requests according to predefined kernel policies. The purpose of the scheduler is to group requests of data that lie near each other on the physical medium. We will describe this component in the section "The I/O Scheduler" later in this chapter.
最后,块设备驱动程序通过向磁盘控制器的硬件接口发送适当的命令来处理实际的数据传输。我们将在本章后面的“块设备驱动程序”部分中解释通用块设备驱动程序的整体组织。
Finally, the block device drivers take care of the actual data transfer by sending suitable commands to the hardware interfaces of the disk controllers. We will explain the overall organization of a generic block device driver in the section "Block Device Drivers" later in this chapter.
正如您所看到的,有许多内核组件与存储在块设备中的数据有关;它们每个都使用不同长度的块来管理磁盘数据:
As you can see, there are many kernel components that are concerned with data stored in block devices; each of them manages the disk data using chunks of different length:
硬件块设备的控制器以固定长度的块(称为“扇区”)传输数据。因此,I/O 调度程序和块设备驱动程序必须管理数据扇区。
The controllers of the hardware block devices transfer data in chunks of fixed length called "sectors." Therefore, the I/O scheduler and the block device drivers must manage sectors of data.
虚拟文件系统、映射层和文件系统将磁盘数据分组为称为“块”的逻辑单元。块对应于文件系统内的最小磁盘存储单元。
The Virtual Filesystem, the mapping layer, and the filesystems group the disk data in logical units called "blocks." A block corresponds to the minimal disk storage unit inside a filesystem.
正如我们稍后将看到的,块设备驱动程序应该能够处理数据“段”:每个段都是一个内存页(或内存页的一部分),包括磁盘上物理上相邻的数据块。
As we will see shortly, block device drivers should be able to cope with "segments" of data: each segment is a memory page—or a portion of a memory page—including chunks of data that are physically adjacent on disk.
The disk caches work on "pages" of disk data, each of which fits in a page frame.
The generic block layer glues together all the upper and lower components, thus it knows about sectors , blocks, segments, and pages of data.
即使有许多不同的数据块,它们通常也共享相同的物理 RAM 单元。例如,图 14-2显示了 4,096 字节页面的布局。上层内核组件将页面视为由四个块缓冲区组成,每个块缓冲区为 1,024 字节。该页的最后三个块正在由块设备驱动程序传输,因此它们被插入到覆盖该页的最后 3,072 字节的段中。硬盘控制器认为该段由六个 512 字节扇区组成。
Even if there are many different chunks of data, they usually share the same physical RAM cells. For instance, Figure 14-2 shows the layout of a 4,096-byte page. The upper kernel components see the page as composed of four block buffers of 1,024 bytes each. The last three blocks of the page are being transferred by the block device driver, thus they are inserted in a segment covering the last 3,072 bytes of the page. The hard disk controller considers the segment as composed of six 512-byte sectors.
在本章中,我们描述处理块设备的较低内核组件——通用块层、I/O 调度程序和块设备驱动程序——因此我们将注意力集中在扇区、块和段上。
In this chapter we describe the lower kernel components that handle the block devices—generic block layer, I/O scheduler, and block device drivers—thus we focus our attention on sectors, blocks, and segments.
为了获得可接受的性能,硬盘和类似设备一次传输多个相邻字节。块设备的每个数据传输操作都作用于一组称为扇区的相邻字节。在下面的讨论中,我们说字节组是相邻的 当它们以单个查找操作可以访问它们的方式记录在磁盘表面上时。尽管磁盘的物理几何结构通常非常复杂,但硬盘控制器接受将磁盘称为大型扇区阵列的命令。
To achieve acceptable performance, hard disks and similar devices transfer several adjacent bytes at once. Each data transfer operation for a block device acts on a group of adjacent bytes called a sector. In the following discussion, we say that groups of bytes are adjacent when they are recorded on the disk surface in such a manner that a single seek operation can access them. Although the physical geometry of a disk is usually very complicated, the hard disk controller accepts commands that refer to the disk as a large array of sectors.
在大多数磁盘设备中,扇区的大小为 512 字节,尽管有些设备使用更大的扇区(1,024 和 2,048 字节)。注意,扇区应该被视为数据传输的基本单位;尽管大多数磁盘设备能够同时传输几个相邻的扇区,但传输的扇区永远不可能少于一个。
In most disk devices, the size of a sector is 512 bytes, although there are devices that use larger sectors (1,024 and 2,048 bytes). Notice that the sector should be considered as the basic unit of data transfer; it is never possible to transfer less than one sector, although most disk devices are capable of transferring several adjacent sectors at once.
在Linux中,一个扇区的大小通常设置为512字节;如果块设备使用较大的扇区,则相应的低级块设备驱动程序将执行必要的转换。因此,存储在块设备中的一组数据在磁盘上通过其位置(第一个 512 字节扇区的索引)及其作为 512 字节扇区数量的长度来标识。扇区索引存储在 32 位或 64 位类型的变量中sector_t。
In Linux, the size of a sector is conventionally set to 512
bytes; if a block device uses larger sectors, the corresponding
low-level block device driver will do the necessary conversions. Thus,
a group of data stored in a block device is identified on disk by its
position—the index of the first 512-byte sector—and its length as
number of 512-byte sectors. Sector indices are stored in 32- or 64-bit
variables of type sector_t.
扇区是硬件设备数据传输的基本单位,而块是VFS 乃至文件系统数据传输的基本单位。例如,当内核访问一个文件的内容时,它必须首先从磁盘读取一个包含该文件的磁盘inode的块(参见第12章的“Inode对象”一节)。磁盘上的这一块对应于一个或多个相邻扇区,VFS 将这些扇区视为单个数据单元。
While the sector is the basic unit of data transfer for the hardware devices, the block is the basic unit of data transfer for the VFS and, consequently, for the filesystems. For example, when the kernel accesses the contents of a file, it must first read from disk a block containing the disk inode of the file (see the section "Inode Objects" in Chapter 12). This block on disk corresponds to one or more adjacent sectors, which are looked at by the VFS as a single data unit.
在Linux中,块大小必须是2的幂并且不能大于页框。此外,它必须是扇区大小的倍数,因为每个块必须包含整数个扇区。因此,在 80 × 86 架构上,允许的块大小为 512、1,024、2,048 和 4,096 字节。
In Linux, the block size must be a power of 2 and cannot be larger than a page frame. Moreover, it must be a multiple of the sector size, because each block must include an integral number of sectors. Therefore, on 80 × 86 architecture, the permitted block sizes are 512, 1,024, 2,048, and 4,096 bytes.
块大小并不特定于块设备。创建基于磁盘的文件系统时,管理员可以选择适当的块大小。因此,同一磁盘上的多个分区可能使用不同的块大小。此外,对块设备文件发出的每个读或写操作都是绕过基于磁盘的文件系统的“原始”访问;内核使用块来执行它最大尺寸(4,096 字节)。
The block size is not specific to a block device. When creating a disk-based filesystem, the administrator may select the proper block size. Thus, several partitions on the same disk might make use of different block sizes. Furthermore, each read or write operation issued on a block device file is a "raw" access that bypasses the disk-based filesystem; the kernel executes it by using blocks of largest size (4,096 bytes).
每个块都需要自己的块缓冲区,这是内核用来存储块内容的 RAM 内存区域。当内核从磁盘读取一个块时,它会用从硬件设备获取的值填充相应的块缓冲区;类似地,当内核在磁盘上写入块时,它会使用关联块缓冲区的实际值更新硬件设备上相应的一组相邻字节。块缓冲区的大小始终与相应块的大小匹配。
Each block requires its own block buffer, which is a RAM memory area used by the kernel to store the block's content. When the kernel reads a block from disk, it fills the corresponding block buffer with the values obtained from the hardware device; similarly, when the kernel writes a block on disk, it updates the corresponding group of adjacent bytes on the hardware device with the actual values of the associated block buffer. The size of a block buffer always matches the size of the corresponding block.
每个缓冲区都有一个类型为“缓冲区头”的描述符buffer_head。该描述符包含内核了解如何处理缓冲区所需的所有信息;因此,在对每个缓冲区进行操作之前,内核会检查其缓冲区头。我们将在第15章对缓冲区头的所有字段进行详细解释;然而,在本章中,我们将只考虑几个字段:b_page、b_data、b_blocknr和b_bdev。
Each buffer has a "buffer head" descriptor of type buffer_head. This descriptor contains all
the information needed by the kernel to know how to handle the buffer;
thus, before operating on each buffer, the kernel checks its buffer
head. We will give a detailed explanation of all fields of the buffer
head in Chapter 15; in the
present chapter, however, we will only consider a few fields: b_page, b_data, b_blocknr, and b_bdev.
该b_page字段存储包含块缓冲区的页帧的页描述符地址。如果页框位于高端内存,则该b_data字段存储块缓冲区在页内的偏移量;否则,它存储块缓冲区本身的起始线性地址。该b_blocknr字段存储逻辑块号(即磁盘分区内块的索引)。最后,该字段标识正在使用缓冲区头的块设备(请参阅本章后面的“块设备b_bdev”部分)。
The b_page field stores the
page descriptor address of the page frame that includes the block
buffer. If the page frame is in high memory, the b_data field stores the offset of the block
buffer inside the page; otherwise, it stores the starting linear
address of the block buffer itself. The b_blocknr field stores the logical block
number (i.e., the index of the block inside the disk partition).
Finally, the b_bdev field
identifies the block device that is using the buffer head (see the
section "Block
Devices" later in this chapter).
我们知道每个磁盘 I/O 操作都包括将某些相邻扇区的内容从某些 RAM 位置传输到某些 RAM 位置。在几乎所有情况下,数据传输都是由磁盘控制器通过 DMA 操作直接执行的(参见第 13 章中的“直接内存访问(DMA) ”部分)。块设备驱动程序通过向磁盘控制器发送适当的命令来简单地触发数据传输;一旦数据传输完成,控制器就会发出中断来通知块设备驱动程序。
We know that each disk I/O operation consists of transferring the contents of some adjacent sectors from—or to—some RAM locations. In almost all cases, the data transfer is directly performed by the disk controller with a DMA operation (see the section "Direct Memory Access (DMA)" in Chapter 13). The block device driver simply triggers the data transfer by sending suitable commands to the disk controller; once the data transfer is finished, the controller raises an interrupt to notify the block device driver.
单个 DMA 操作传输的数据必须属于磁盘上相邻的扇区。这是一个物理限制:允许 DMA 传输到非相邻扇区的磁盘控制器的传输速率会很差,因为在磁盘表面上移动读/写头是一个相当慢的操作。
The data transferred by a single DMA operation must belong to sectors that are adjacent on disk. This is a physical constraint: a disk controller that allows DMA transfers to non-adjacent sectors would have a poor transfer rate, because moving a read/write head on the disk surface is quite a slow operation.
较旧的磁盘控制器仅支持“简单”DMA 操作:在每个此类操作中,数据从 RAM 中物理上连续的内存单元传输或传输到内存单元。然而,最近的磁盘控制器也可能支持所谓的分散-集中 DMA 传输 :在每个此类操作中,数据可以从多个不连续的存储区域传输或传输到多个不连续的存储区域。
Older disk controllers support "simple" DMA operations only: in each such operation, data is transferred from or to memory cells that are physically contiguous in RAM. Recent disk controllers, however, may also support the so-called scatter-gather DMA transfers : in each such operation, the data can be transferred from or to several noncontiguous memory areas.
对于每个分散-聚集 DMA 传输,块设备驱动程序必须向磁盘控制器发送:
For each scatter-gather DMA transfer, the block device driver must send to the disk controller:
磁盘初始扇区数和要传输的扇区总数
The initial disk sector number and the total number of sectors to be transferred
内存区域描述符列表,每个描述符由一个地址和一个长度组成。
A list of descriptors of memory areas, each of which consists of an address and a length.
磁盘控制器负责整个数据传输;例如,在读取操作中,控制器从相邻磁盘扇区获取数据并将其分散到各个存储区域中。
The disk controller takes care of the whole data transfer; for instance, in a read operation the controller fetches the data from the adjacent disk sectors and scatters it into the various memory areas.
为了利用分散-聚集 DMA 操作,块设备驱动程序必须以称为段的单位处理 数据 。段只是一个内存页或内存页的一部分,其中包括一些相邻磁盘扇区的数据。因此,分散-聚集 DMA 操作可能同时涉及多个段。
To make use of scatter-gather DMA operations, block device drivers must handle the data in units called segments . A segment is simply a memory page—or a portion of a memory page—that includes the data of some adjacent disk sectors. Thus, a scatter-gather DMA operation may involve several segments at once.
请注意,块设备驱动程序不需要了解块、块大小和块缓冲区。因此,即使一个段被更高层视为由几个块缓冲区组成的页,块设备驱动程序也不关心它。
Notice that a block device driver does not need to know about blocks, block sizes, and block buffers. Thus, even if a segment is seen by the higher levels as a page composed of several block buffers, the block device driver does not care about it.
正如我们将看到的,如果相应的页框恰好在 RAM 中是连续的并且相应的磁盘数据块在磁盘上是相邻的,则通用块层可以合并不同的段。此合并操作产生的较大内存区域称为 物理段。
As we'll see, the generic block layer can merge different segments if the corresponding page frames happen to be contiguous in RAM and the corresponding chunks of disk data are adjacent on disk. The larger memory area resulting from this merge operation is called physical segment.
在通过专用总线电路(IO-MMU;参见第 13 章中的“直接内存访问(DMA) ”部分)处理总线地址和物理地址之间的映射的体系结构上允许进行另一种合并操作。这种合并操作产生的内存区域称为硬件段 。因为我们将重点关注 80 × 86 架构,该架构在总线地址和物理地址之间没有这种动态映射,所以在本章的其余部分中我们将假设硬件段始终与物理段一致。
Yet another merge operation is allowed on architectures that handle the mapping between bus addresses and physical addresses through a dedicated bus circuitry (the IO-MMU; see the section "Direct Memory Access (DMA)" in Chapter 13). The memory area resulting from this kind of merge operation is called hardware segment . Because we will focus on the 80 × 86 architecture, which has no such dynamic mapping between bus addresses and physical addresses, we will assume in the rest of this chapter that hardware segments always coincide with physical segments .
[ * ]但是,如果读取访问是在原始块设备文件上完成的,则映射层不会调用特定于文件系统的方法;相反,它将块设备文件中的偏移量转换为磁盘(或磁盘分区)内与设备文件相对应的位置。
[*] However, if the read access was done on a raw block device file, the mapping layer does not invoke a filesystem-specific method; rather, it translates the offset in the block device file to a position inside the disk—or disk partition—corresponding to the device file.
通用块层是一个内核组件,负责处理系统中所有块设备的请求。由于其功能,内核可以轻松地:
The generic block layer is a kernel component that handles the requests for all block devices in the system. Thanks to its functions, the kernel may easily:
将数据缓冲区放在高端内存中——只有当 CPU 必须访问数据时,页帧才会被映射到内核线性地址空间,并在之后立即取消映射。
Put data buffers in high memory—the page frame(s) will be mapped in the kernel linear address space only when the CPU must access the data, and will be unmapped right after.
通过一些额外的努力,实现“零复制”模式,其中磁盘数据直接放入用户模式地址空间,而不首先复制到内核内存;本质上,内核用于 I/O 传输的缓冲区位于映射到进程的用户模式线性地址空间中的页帧中。
Implement—with some additional effort—a "zero-copy" schema, where disk data is directly put in the User Mode address space without being copied to kernel memory first; essentially, the buffer used by the kernel for the I/O transfer lies in a page frame mapped in the User Mode linear address space of a process.
管理逻辑卷,例如 LVM(逻辑卷管理器)和 RAID(廉价磁盘冗余阵列)使用的逻辑卷:多个磁盘分区,即使位于不同的块设备上,也可以视为单个分区。
Manage logical volumes—such as those used by LVM (the Logical Volume Manager) and RAID (Redundant Array of Inexpensive Disks): several disk partitions, even on different block devices, can be seen as a single partition.
Exploit the advanced features of the most recent disk controllers, such as large onboard disk caches , enhanced DMA capabilities, onboard scheduling of the I/O transfer requests, and so on.
通用块层的核心数据结构是一个正在进行的 I/O 块设备操作的描述符,称为
bio。每个bio本质上包括磁盘存储区域的标识符——初始扇区号和存储区域中包含的扇区数量——以及描述I/O操作中涉及的存储器区域的一个或多个段。Bio由数据结构实现,其字段如表14-1bio所示。
The core data structure of the generic block layer is a
descriptor of an ongoing I/O block device operation called
bio. Each bio essentially includes an identifier
for a disk storage area—the initial sector number and the number of
sectors included in the storage area—and one or more segments
describing the memory areas involved in the I/O operation. A bio is
implemented by the bio data
structure, whose fields are listed in Table 14-1.
表 14-1。生物结构领域
Table 14-1. The fields of the bio structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 块 I/O 操作的磁盘上的第一个扇区 First sector on disk of block I/O operation |
生物结构* struct bio * | 下一个 bi_next | 链接到请求队列中的下一个简介 Link to the next bio in the request queue |
| | 指向块设备描述符的指针 Pointer to block device descriptor |
| | 生物状态标志 Bio status flags |
| | I/O 操作标志 I/O operation flags |
| | Bio
Number of segments in the bio's
|
| | Bio Current index in the bio's |
| | 合并后bio的物理段数 Number of physical segments of the bio after merging |
| | 合并后硬件段数 Number of hardware segments after merging |
| | 待传输的字节数 Bytes (yet) to be transferred |
无符号整数 unsigned int | bi_hw_front_size bi_hw_front_size | 由硬件段合并算法使用 Used by the hardware segment merge algorithm |
无符号整数 unsigned int | bi_hw_back_size bi_hw_back_size | 由硬件段合并算法使用 Used by the hardware segment merge algorithm |
| | Bio 数组中允许的最大段 Maximum allowed number of segments
in the bio's |
结构bio_vec * struct bio_vec * | 双io_vec bi_io_vec | 指向bio的 Pointer to the bio's |
生物结束io_t * bio_end_io_t * | 双结束io bi_end_io | Bio 的 I/O 操作结束时调用的方法 Method invoked at the end of bio's I/O operation |
原子_t atomic_t | 双碳纳米管 bi_cnt | 参考计数器 Reference counter for the |
空白 * void * | 双私有 bi_private | 通用块层使用的指针和块设备驱动程序的I/O完成方法 Pointer used by the generic block layer and the I/O completion method of the block device driver |
生物析构函数_t * bio_destructor_t * | 双析构函数 bi_destructor |
Destructor method (usually |
Bio中的每个段都由一个数据结构表示bio_vec,其字段列于表14-2中。bio的字段bi_io_vec指向bio_vec数据结构数组的第一个元素,而该bi_vcnt字段存储数组中当前元素的数量。
Each segment in a bio is represented by a bio_vec data structure, whose fields are
listed in Table
14-2. The bi_io_vec field of
the bio points to the first element of an array of bio_vec data structures, while the bi_vcnt field stores the current number of
elements in the array.
表 14-2。bio_vec结构体的字段
Table 14-2. The fields of the bio_vec structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向段的页框的页描述符的指针 Pointer to the page descriptor of the segment's page frame |
| | 段的长度(以字节为单位) Length of the segment in bytes |
| | 段数据在页框中的偏移量 Offset of the segment's data in the page frame |
描述符的内容bio
在块 I/O 操作期间不断变化。例如,如果块设备驱动程序无法通过一次分散-聚集 DMA 操作执行整个数据传输,则bi_idx更新该字段以跟踪 BIOS 中尚未传输的第一个段。为了迭代 Bio 的各个段(从索引处的当前段开始
bi_idx),设备驱动程序可以执行bio_for_each_segment
宏。
The contents of a bio
descriptor keep changing during the block I/O operation. For instance,
if the block device driver cannot perform the whole data transfer with
one scatter-gather DMA operation, the bi_idx field is updated to keep track of the
first segment in the bio that is yet to be transferred. To iterate
over the segments of a bio—starting from the current segment at index
bi_idx—a device driver can execute
the bio_for_each_segment
macro.
当通用块层开始新的I/O操作时,它通过调用该bio_alloc( )函数来分配新的bio结构。通常,BIOS是通过slab分配器来分配的,但是内核也会保留一个小的BIOS内存池,以在内存稀缺时使用(参见第8章的“内存池”一节)。内核还为结构保留一个内存池——毕竟,如果无法分配要包含在 Bio 中的段描述符,则分配 Bio 是没有意义的。相应地,该
函数递减引用计数器(bio_vecbio_put( )bi_cnt)的bio,如果计数器变为零,则释放bio结构和相关bio_vec
结构。
When the generic block layer starts a new I/O operation, it
allocates a new bio structure by invoking the bio_alloc( ) function. Usually, bios are
allocated through the slab allocator, but the kernel also keeps a
small memory pool of bios to be used when memory is scarce (see the
section "Memory
Pools" in Chapter 8).
The kernel also keeps a memory pool for the bio_vec structures—after all, it would not
make sense to allocate a bio without being able to allocate the
segment descriptors to be included in the bio. Correspondingly, the
bio_put( ) function decrements the
reference counter (bi_cnt) of a bio
and, if the counter becomes zero, it releases the bio structure and
the related bio_vec
structures.
磁盘是由通用块层处理的逻辑块设备。通常,磁盘对应于硬件块设备,例如硬盘、软盘或CD-ROM盘。但是,磁盘可以是构建在多个物理磁盘分区上的虚拟设备,也可以是位于 RAM 的某些专用页面中的存储区域。无论如何,上层内核组件都对所有磁盘进行操作同样感谢通用块层提供的服务。
A disk is a logical block device that is handled by the generic block layer. Usually a disk corresponds to a hardware block device such as a hard disk, a floppy disk, or a CD-ROM disk. However, a disk can be a virtual device built upon several physical disk partitions, or a storage area living in some dedicated pages of RAM. In any case, the upper kernel components operate on all disks in the same way thanks to the services offered by the generic block layer.
磁盘由对象表示,其字段如表14-3gendisk所示
。
A disk is represented by the gendisk object, whose fields are shown in
Table 14-3.
表 14-3。gendisk 对象的字段
Table 14-3. The fields of the gendisk object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 磁盘主编号 Major number of the disk |
| | 与磁盘关联的第一个次要编号 First minor number associated with the disk |
| | 与磁盘关联的次要编号范围 Range of minor numbers associated with the disk |
| | 磁盘的常规名称(通常是相应设备文件的规范名称) Conventional name of the disk (usually, the canonical name of the corresponding device file) |
| | 磁盘的分区描述符数组 Array of partition descriptors for the disk |
| | 指向块设备方法表的指针 Pointer to a table of block device methods |
| | 指向磁盘请求队列的指针(参见本章后面的“请求队列描述符”) Pointer to the request queue of the disk (see "Request Queue Descriptors" later in this chapter) |
| | 块设备驱动程序的私有数据 Private data of the block device driver |
| | 磁盘存储区域的大小(以扇区数为单位) Size of the storage area of the disk (in number of sectors) |
| | 描述磁盘类型的标志(见下文) Flags describing the kind of disk (see below) |
| | Device filename in the (nowadays deprecated) devfs special filesystem |
| | 不再使用 No longer used |
| | 指向 Pointer to the |
| | 嵌入式 kobject(参见第 13 章中的“ Kobjects ”部分) Embedded kobject (see the section "Kobjects" in Chapter 13) |
| | 指向记录磁盘中断时间的数据结构的指针;由内核内置随机数生成器使用 Pointer to a data structure that records the timing of the disk's interrupts; used by the kernel built-in random number generator |
| | 如果磁盘是只读的(禁止写操作)则设置为 1,否则设置为 0 Set to 1 if the disk is read-only (write operations forbidden), 0 otherwise |
| | 写入磁盘的扇区计数器,仅用于 RAID Counter of sectors written to disk, used only for RAID |
| | 用于确定磁盘队列使用统计信息的时间戳 Timestamp used to determine disk queue usage statistics |
| | 与上面相同 Same as above |
| | 正在进行的 I/O 操作数 Number of ongoing I/O operations |
| | 有关每个 CPU 磁盘使用情况的统计信息 Statistics about per-CPU disk usage |
该flags字段存储有关磁盘的信息。最重要的标志是GENHD_FL_UP:如果设置了该标志,则磁盘已初始化并正在工作。另一个相关标志是GENHD_FL_REMOVABLE,如果磁盘是可移动载体(例如软盘或 CD-ROM),则设置该标志。
The flags field stores
information about the disk. The most important flag is GENHD_FL_UP: if it is set, the disk is
initialized and working. Another relevant flag is GENHD_FL_REMOVABLE, which is set if the disk
is a removable support, such as a floppy disk or a CD-ROM.
该对象fops的字段
gendisk指向一个block_device_operations表,该表存储了块设备关键操作的一些自定义方法(见表
14-4)。
The fops field of the
gendisk object points to a block_device_operations table, which stores
a few custom methods for crucial operations of the block device (see
Table 14-4).
表 14-4。块设备的方法
Table 14-4. The methods of the block devices
硬盘通常分为逻辑 分区 。每个块设备文件可以代表整个磁盘或磁盘内的一个分区。例如,主 EIDE 磁盘可能由主设备号为 3、次设备号为 0 的设备文件/dev/hda表示;磁盘内的前两个分区可能由分别具有主设备号 3 和次设备号 1 和 2 的设备文件/dev/hda1和/dev/hda2表示。一般来说,磁盘内的分区由连续的次编号来表征。
Hard disks are commonly split into logical partitions . Each block device file may represent either a whole disk or a partition inside the disk. For instance, a master EIDE disk might be represented by a device file /dev/hda having major number 3 and minor number 0; the first two partitions inside the disk might be represented by device files /dev/hda1 and /dev/hda2 having major number 3 and minor numbers 1 and 2, respectively. In general, the partitions inside a disk are characterized by consecutive minor numbers.
如果磁盘被分割成多个分区,它们的布局将保存在一个结构数组中,hd_struct其地址存储在对象part的字段中gendisk。该数组通过磁盘内分区的相对索引进行索引。描述符的字段hd_struct列于表14-5中。
If a disk is split in partitions, their layout is kept in an
array of hd_struct structures whose
address is stored in the part field
of the gendisk object. The array is
indexed by the relative index of the partition inside the disk. The
fields of the hd_struct descriptor
are listed in Table
14-5.
表 14-5。hd_struct 描述符的字段
Table 14-5. The fields of the hd_struct descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 磁盘内分区的起始扇区 Starting sector of the partition inside the disk |
| | 分区长度(扇区数) Length of the partition (number of sectors) |
| | 嵌入式 kobject(参见第 13 章中的“ Kobjects ”部分) Embedded kobject (see the section "Kobjects" in Chapter 13) |
| | 对分区发出的读操作数 Number of read operations issued on the partition |
| | 从分区读取的扇区数 Number of sectors read from the partition |
| | 对分区发出的写操作数 Number of write operations issued on the partition |
| | 写入分区的扇区数 Number of sectors written into the partition |
| | 如果分区是只读的则设置为 1,否则设置为 0 Set to 1 if the partition is read-only, 0 otherwise |
| | 磁盘内分区的相对索引 The relative index of the partition inside the disk |
当内核在系统中发现新磁盘时(在引导阶段,或者当可移动介质插入驱动器时,或者在运行时连接外部磁盘时),它会调用该函数,该函数分配并初始化alloc_disk( )一个新gendisk对象,如果新磁盘分为多个分区,则还需要一个合适的hd_struct描述符数组。然后,它调用该add_disk( )函数将新描述符插入通用块层的数据结构中(请参阅本章后面的“设备驱动程序注册和初始化gendisk”部分)。
When the kernel discovers a new disk in the system (in the boot
phase, or when a removable media is inserted in a drive, or when an
external disk is attached at run-time), it invokes the alloc_disk( ) function, which allocates and
initializes a new gendisk object
and, if the new disk is split in several partitions, a suitable array
of hd_struct descriptors. Then, it
invokes the add_disk( ) function to
insert the new gendisk descriptor
into the data structures of the generic block layer (see the section
"Device Driver
Registration and Initialization" later in this chapter).
让我们描述向通用块层提交 I/O 操作请求时内核执行的常见步骤顺序。我们假设请求的数据块在磁盘上相邻,并且内核已经确定了它们的物理位置。
Let us describe the common sequence of steps executed by the kernel when submitting an I/O operation request to the generic block layer. We'll assume that the requested chunks of data are adjacent on disk and that the kernel has already determined their physical location.
第一步是执行bio_alloc( )函数来分配新的生物描述符。然后,内核通过设置几个字段来初始化生物描述符:
The first step consists in executing the bio_alloc( ) function to allocate a new bio
descriptor. Then, the kernel initializes the bio descriptor by setting
a few fields:
该bi_sector字段设置为数据的初始扇区号(如果块设备分为多个分区,则扇区号相对于分区的开头)。
The bi_sector field is
set to the initial sector number of the data (if the block device
is split in several partitions, the sector number is relative to
the start of the partition).
该bi_size字段设置为覆盖数据的扇区数。
The bi_size field is set
to the number of sectors covering the data.
该bi_bdev字段设置为块设备描述符的地址(请参阅本章后面的“块设备”部分)。
The bi_bdev field is set
to the address of the block device descriptor (see the section
"Block
Devices" later in this chapter).
该bi_io_vec字段设置为一个bio_vec数据结构数组的起始地址,每个数据结构描述了I/O操作涉及的一个段(内存缓冲区);此外,该bi_vcnt字段被设置为生物中的段总数。
The bi_io_vec field is
set to the initial address of an array of bio_vec data structures, each of which
describes a segment (memory buffer) involved in the I/O operation;
moreover, the bi_vcnt field is
set to the total number of segments in the bio.
该bi_rw字段设置有请求操作的标志。最重要的标志指定数据传输方向:READ(0) 或WRITE(1)。
The bi_rw field is set
with the flags of the requested operation. The most important flag
specifies the data transfer direction: READ (0) or WRITE (1).
该bi_end_io字段被设置为每当bio上的I/O操作完成时执行的完成过程的地址。
The bi_end_io field is
set to the address of a completion procedure that is executed
whenever the I/O operation on the bio is completed.
一旦bio描述符被正确初始化,内核就会调用该generic_make_request( )函数,该函数是通用块层的主要入口点。该函数主要执行以下步骤:
Once the bio descriptor has
been properly initialized, the kernel invokes the generic_make_request( ) function, which is
the main entry point of the generic block layer. The function
essentially executes the following steps:
检查bio->bi_sector不超过块设备的扇区数。如果是,该函数将设置 标志BIO_EOF、
bio->bi_flags打印内核错误消息、调用该bio_endio()函数并终止。
bio_endio( )更新
bio描述符的bi_size和bi_sector字段,并调用bi_end_io
bio的方法。后一个函数的实现本质上依赖于触发I/O数据传输的内核组件;bi_end_io我们将在接下来的章节中看到一些方法的示例。
Checks that bio->bi_sector does not exceed the
number of sectors of the block device. If it does, the function
sets the BIO_EOF flag of
bio->bi_flags, prints a
kernel error message, invokes the bio_endio() function, and terminates.
bio_endio( ) updates the
bi_size and bi_sector fields of the bio descriptor,
and it invokes the bi_end_io
bio's method. The implementation of the latter function
essentially depends on the kernel component that has triggered the
I/O data transfer; we will see some examples of bi_end_io methods in the following
chapters.
q
获取与块设备关联的请求队列(参见本章后面的“请求队列描述符”一节);它的地址可以在块设备描述符的字段中找到bd_disk,而块设备描述符又由该字段指向
bio->bi_bdev。
Gets the request queue q
associated with the block device (see the section "Request Queue
Descriptors" later in this chapter); its address can be
found in the bd_disk field of
the block device descriptor, which in turn is pointed to by the
bio->bi_bdev field.
调用block_wait_queue_running(
)检查当前使用的I/O调度器是否正在被动态替换;在这种情况下,该函数会将进程置于睡眠状态,直到启动新的 I/O 调度程序(请参阅下一节“ I/O 调度程序”)。
Invokes block_wait_queue_running(
) to check whether the I/O scheduler currently in use is
being dynamically replaced; in this case, the function puts the
process to sleep until the new I/O scheduler is started (see the
next section "The I/O
Scheduler").
调用blk_partition_remap(
)以检查块设备是否引用磁盘分区(bio->bi_bdev不等于;请参阅本章后面的“块设备bio->bi_dev->bd_contains”部分)。在这种情况下,该函数从字段中获取分区的描述符以执行以下子步骤:hd_structbio->bi_bdev
根据数据传输方向更新描述符的read_sectors和reads字段或write_sectors和writes字段。hd_struct
调整该bio->bi_sector字段,将相对于分区开头的扇区号转换为相对于整个磁盘的扇区号。
将字段设置bio->bi_bdev为整个磁盘的块设备描述符 ( bio->bd_contains)。
从现在开始,通用块层、I/O 调度程序和设备驱动程序将忘记磁盘分区,并直接与整个磁盘一起工作。
Invokes blk_partition_remap(
) to check whether the block device refers to a disk
partition (bio->bi_bdev not
equal to bio->bi_dev->bd_contains; see the
section "Block
Devices" later in this chapter). In this case, the function
gets the hd_struct descriptor
of the partition from the bio->bi_bdev field to perform the
following substeps:
Updates the read_sectors and reads fields, or the write_sectors and writes fields, of the hd_struct descriptor, according to
the direction of data transfer.
Adjusts the bio->bi_sector field so as to
transform the sector number relative to the start of the
partition to a sector number relative to the whole
disk.
Sets the bio->bi_bdev field to the block
device descriptor of the whole disk (bio->bd_contains).
From now on, the generic block layer, the I/O scheduler, and the device driver forget about disk partitioning, and work directly with the whole disk.
调用该q->make_request_fn方法以将请求插入bio到请求队列中q。
Invokes the q->make_request_fn method to insert
the bio request in the request
queue q.
返回。
Returns.
我们将在本章后面的“向 I/O 调度程序发出请求make_request_fn”部分讨论该方法的典型实现。
We will discuss a typical implementation of the make_request_fn method in the section "Issuing a Request to the I/O
Scheduler" later in this chapter.
尽管块设备驱动程序能够一次传输单个扇区,但块 I/O 层不会对磁盘上要访问的每个扇区执行单独的 I/O 操作;这会导致磁盘性能较差,因为定位扇区在磁盘表面上的物理位置非常耗时。相反,只要有可能,内核就会尝试将多个扇区聚集在一起并将它们作为一个整体进行处理,从而减少磁头移动的平均次数。
Although block device drivers are able to transfer a single sector at a time, the block I/O layer does not perform an individual I/O operation for each sector to be accessed on disk; this would lead to poor disk performance, because locating the physical position of a sector on the disk surface is quite time-consuming. Instead, the kernel tries, whenever possible, to cluster several sectors and handle them as a whole, thus reducing the average number of head movements.
当内核组件希望读取或写入某些磁盘数据时,它实际上会创建一个块设备请求。该请求本质上描述了所请求的扇区以及要对其执行的操作类型(读或写)。然而,内核不会在创建后立即满足请求——I/O 操作只是被调度并会在稍后的时间执行。矛盾的是,这种人为延迟是提高块设备性能的关键机制。当请求新的块数据传输时,内核通过稍微放大仍在等待的先前请求来检查是否可以满足该请求(即,是否可以在不进行进一步寻道操作的情况下满足新请求)。由于磁盘往往是按顺序访问的,因此这种简单的机制非常有效。
When a kernel component wishes to read or write some disk data, it actually creates a block device request. That request essentially describes the requested sectors and the kind of operation to be performed on them (read or write). However, the kernel does not satisfy a request as soon as it is created—the I/O operation is just scheduled and will be performed at a later time. This artificial delay is paradoxically the crucial mechanism for boosting the performance of block devices. When a new block data transfer is requested, the kernel checks whether it can be satisfied by slightly enlarging a previous request that is still waiting (i.e., whether the new request can be satisfied without further seek operations). Because disks tend to be accessed sequentially, this simple mechanism is very effective.
推迟请求会使块设备处理变得复杂。例如,假设一个进程打开一个常规文件,因此文件系统驱动程序想要从磁盘读取相应的索引节点。块设备驱动程序将请求放入队列中,并且进程被挂起,直到存储 inode 的块被传输。但是,块设备驱动程序本身无法被阻止,因为尝试访问同一磁盘的任何其他进程也会被阻止。
Deferring requests complicates block device handling. For instance, suppose a process opens a regular file and, consequently, a filesystem driver wants to read the corresponding inode from disk. The block device driver puts the request on a queue, and the process is suspended until the block storing the inode is transferred. However, the block device driver itself cannot be blocked, because any other process trying to access the same disk would be blocked as well.
为了防止块设备驱动程序被挂起,每个 I/O 操作都是异步处理的。特别是,块设备驱动程序是中断驱动的(请参阅上一章中的 “监视 I/O 操作”部分):通用块层调用I/O 调度程序来创建新的块设备请求或放大已存在的块设备请求。 1 然后终止。稍后激活的块设备驱动程序调用 策略例程选择待处理的请求并通过向磁盘控制器发出适当的命令来满足它。当 I/O 操作终止时,磁盘控制器会引发中断,并且相应的处理程序会在必要时再次调用策略例程,以处理另一个挂起的请求。
To keep the block device driver from being suspended, each I/O operation is processed asynchronously. In particular, block device drivers are interrupt-driven (see the section "Monitoring I/O Operations" in the previous chapter): the generic block layer invokes the I/O scheduler to create a new block device request or to enlarge an already existing one and then terminates. The block device driver, which is activated at a later time, invokes the strategy routine to select a pending request and satisfy it by issuing suitable commands to the disk controller. When the I/O operation terminates, the disk controller raises an interrupt and the corresponding handler invokes the strategy routine again, if necessary, to process another pending request.
每个块设备驱动程序都维护自己的请求队列,其中包含设备的待处理请求列表。如果磁盘控制器正在处理多个磁盘,则每个物理块设备通常有一个请求队列。I/O调度在每个请求队列上单独进行,从而提高磁盘性能。
Each block device driver maintains its own request queue, which contains the list of pending requests for the device. If the disk controller is handling several disks, there is usually one request queue for each physical block device. I/O scheduling is performed separately on each request queue, thus increasing disk performance.
每个请求队列都通过一个大型
request_queue数据结构来表示,其字段列于表14-6中。
Each request queue is represented by means of a large
request_queue data structure whose
fields are listed in Table
14-6.
表 14-6。请求队列描述符的字段
Table 14-6. The fields of the request queue descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 待处理请求列表 List of pending requests |
| 最后合并 last_merge | 指向队列中首先考虑进行可能合并的请求描述符的指针 Pointer to descriptor of the request in the queue to be considered first for possible merging |
| | 指向电梯对象的指针(参见后面章节“ I/O调度算法”) Pointer to the elevator object (see the later section "I/O Scheduling Algorithms") |
| rq rq | 用于分配请求描述符的数据结构 Data structure used for allocation of request descriptors |
| | 实现驱动程序策略例程入口点的方法 Method that implements the entry point of the strategy routine of the driver |
| | 检查是否可以将bio合并到队列中最后一个请求的方法 Method to check whether it is possible to merge a bio to the last request in the queue |
| | 检查是否可以将bio合并到队列中第一个请求的方法 Method to check whether it is possible to merge a bio to the first request in the queue |
| | 尝试合并队列中两个相邻请求的方法 Method to attempt merging two adjacent requests in the queue |
| | 当必须将新请求插入队列时调用的方法 Method invoked when a new request has to be inserted in the queue |
| 准备rq_fn prep_rq_fn | 构建要发送到硬件设备以处理此请求的命令的方法 Method to build the commands to be sent to the hardware device to process this request |
| | 拔掉块设备的方法(参见本章后面的“激活块设备驱动程序”一节) Method to unplug the block device (see the section "Activating the Block Device Driver" later in the chapter) |
| | 返回添加新段时可以插入到现有bio中的字节数的方法(通常未定义) Method that returns the number of bytes that can be inserted into an existing bio when adding a new segment (usually undefined) |
| | 将请求添加到队列时调用的方法(通常未定义) Method invoked when a request is added to a queue (usually undefined) |
| | 刷新请求队列时调用的方法(通过处理一行中的所有请求来清空队列) Method invoked when a request queue is flushed (the queue is emptied by processing all requests in a row) |
| | 用于执行设备插入的动态定时器(参见后面的“激活块设备驱动程序”部分) Dynamic timer used to perform device plugging (see the later section "Activating the Block Device Driver") |
| | 如果队列中待处理的请求数量超过此值,则立即拔出设备(默认为 4) If the number of pending requests in the queue exceeds this value, the device is immediately unplugged (default is 4) |
| | 设备拔出之前的时间延迟(默认为 3 毫秒) Time delay before device unplugging (default is 3 milliseconds) |
| | 用于拔出设备的工作队列(参见后面章节“激活块设备驱动程序”) Work queue used to unplug the device (see the later section "Activating the Block Device Driver") |
结构体 struct 支持_开发_信息 backing_dev_info | 支持_开发_信息 backing_dev_info | 请参阅此表后面的文字 See the text following this table |
空白 * void * | 队列数据 queuedata | 指向块设备驱动程序私有数据的指针 Pointer to private data of the block device driver |
空白 * void * | 活动数据 activity_data |
Private data used by the |
无符号长 unsigned long | 弹跳_pfn bounce_pfn | 必须使用缓冲区弹跳的页帧编号(请参阅本章后面的“提交请求”部分) Page frame number above which buffer bouncing must be used (see the section "Submitting a Request" later in this chapter) |
整数 int | 反弹gfp bounce_gfp | 反弹缓冲区的内存分配标志 Memory allocation flags for bounce buffers |
无符号长 unsigned long | 队列标志 queue_flags | 描述队列状态的标志集 Set of flags describing the queue status |
自旋锁_t * spinlock_t * | 队列锁 queue_lock | 指向请求队列锁的指针 Pointer to request queue lock |
结构体对象 struct kobject | 科吉 kobj | 请求队列的嵌入kobject Embedded kobject for the request queue |
无符号长 unsigned long | 请求数 nr_requests | 队列中的最大请求数 Maximum number of requests in the queue |
无符号整数 unsigned int | nr_congestion_on nr_congestion_on | 如果待处理请求的数量超过此阈值,则认为队列拥塞 Queue is considered congested if the number of pending requests rises above this threshold |
无符号整数 unsigned int | nr_congestion_off nr_congestion_off | 如果待处理请求的数量低于此阈值,则认为队列不拥塞 Queue is considered not congested if the number of pending requests falls below this threshold |
无符号整数 unsigned int | 批处理数量 nr_batching | 即使队列已满,特殊“批处理”进程也可以提交的待处理请求的最大数量(通常为 32) Maximum number (usually 32) of pending requests that can be submitted even when the queue is full by a special "batcher" process |
无符号短 unsigned short | 最大扇区数 max_sectors | 单个请求处理的最大扇区数(可调) Maximum number of sectors handled by a single request (tunable) |
无符号短 unsigned short | 最大硬件扇区数 max_hw_sectors | 单个请求处理的最大扇区数(硬件限制) Maximum number of sectors handled by a single request (hardware constraint) |
无符号短 unsigned short | 最大物理段数 max_phys_segments | 单个请求处理的最大物理段数 Maximum number of physical segments handled by a single request |
无符号短 unsigned short | 最大硬件段数 max_hw_segments | 单个请求处理的最大硬件段数(分散-聚集 DMA 操作中不同内存区域的最大数量) Maximum number of hardware segments handled by a single request (the maximum number of distinct memory areas in a scatter-gather DMA operation) |
无符号短 unsigned short | 硬节大小 hardsect_size | 扇区的大小(以字节为单位) Size in bytes of a sector |
无符号整数 unsigned int | 最大段大小 max_segment_size | 物理段的最大大小(以字节为单位) Maximum size of a physical segment (in bytes) |
无符号长 unsigned long | 段边界掩码 seg_boundary_mask | 用于段合并的内存边界掩码 Memory boundary mask for segment merging |
无符号整数 unsigned int | DMA_对齐 dma_alignment | DMA 缓冲区初始地址和长度的对齐位图(默认 511) Alignment bitmap for initial address and length of DMA buffers (default 511) |
结构体 struct blk_queue_tag * blk_queue_tag * | 队列标签 queue_tags | 忙/闲标签的位图(用于标记的请求) Bitmap of free/busy tags (used for tagged requests) |
原子_t atomic_t | 参考文献 refcnt | 队列的引用计数器 Reference counter of the queue |
无符号整数 unsigned int | 飞行中 in_flight | 队列中待处理的请求数 Number of pending requests in the queue |
无符号整数 unsigned int | sg_超时 sg_timeout | 用户定义的命令超时(仅由 SCSI 通用设备使用) User-defined command time-out (used only by SCSI generic devices) |
无符号整数 unsigned int | sg_reserved_size sg_reserved_size | 基本未使用 Essentially unused |
结构列表头 struct list_head | 排水列表 drain_list | 请求列表的头部暂时延迟,直到 I/O 调度程序被动态替换 Head of a list of requests temporarily delayed until the I/O scheduler is dynamically replaced |
本质上,请求队列是一个双向链表,其元素是请求描述符(即request
数据结构;请参阅下一节)。请求队列描述符的字段queue_head存储列表的头部(第一个虚拟元素),而queuelist
请求描述符字段中的指针将每个请求链接到列表中的前一个和下一个元素。队列列表中元素的顺序特定于每个块设备驱动程序;然而,I/O 调度程序提供了几种预定义的元素排序方式,这些方式将在后面的“ I/O 调度程序”部分中讨论。
Essentially, a request queue is a doubly linked list whose
elements are request descriptors (that is, request
data structures; see the next section). The queue_head field of the request queue
descriptor stores the head (the first dummy element) of the list,
while the pointers in the queuelist
field of the request descriptor link each request to the previous and
next elements in the list. The ordering of the elements in the queue
list is specific to each block device driver; the I/O scheduler
offers, however, several predefined ways of ordering elements, which
are discussed in the later section "The I/O Scheduler."
该backing_dev_info字段是一个 类型的小对象backing_dev_info,它存储有关底层硬件块设备的 I/O 数据流流量的信息。例如,它保存有关预读和请求队列拥塞状态的信息。
The backing_dev_info field is
a small object of type backing_dev_info, which stores information
about the I/O data flow traffic for the underlying hardware block
device. For instance, it holds information about read-ahead and about
request queue congestion state.
每个对块设备的待处理请求都由请求描述符表示,该请求描述符存储在
表14-7request所示的数据结构中。
Each pending request for a block device is represented
by a request descriptor, which is stored in the
request data structure illustrated
in Table
14-7.
表 14-7。请求描述符的字段
Table 14-7. The fields of the request descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 请求队列列表的指针 Pointers for request queue list |
| | 请求的标志(见下文) Flags of the request (see below) |
| | 下一个要传输的扇区号 Number of the next sector to be transferred |
| | 整个请求中尚未传输的扇区数 Number of sectors yet to be transferred in the whole request |
无符号整数 unsigned int | 当前扇区数 current_nr_sectors | 当前bio的当前段中尚未转移的扇区数量 Number of sectors in the current segment of the current bio yet to be transferred |
| | 下一个要传输的扇区号 Number of the next sector to be transferred |
| | 整个请求中尚未传输的扇区数(由通用块层更新) Number of sectors yet to be transferred in the whole request (updated by the generic block layer) |
| | 当前bio的当前段中尚未传输的扇区数(由通用块层更新) Number of sectors in the current segment of the current bio yet to be transferred (updated by the generic block layer) |
| | 请求中尚未完全传输的第一个bio First bio in the request that has not been completely transferred |
| | 请求列表中的最后一个简介 Last bio in the request list |
| | 指向 I/O 调度程序私有数据的指针 Pointer to private data for the I/O scheduler |
整数 int | rq_状态 rq_status | 请求状态:本质上,
Request status: essentially, either
|
结构体gendisk * struct gendisk * | rq_磁盘 rq_disk | 请求引用的磁盘描述符 The descriptor of the disk referenced by the request |
整数 int | 错误 errors | 当前传输中发生的 I/O 错误数计数器 Counter for the number of I/O errors that occurred on the current transfer |
无符号长 unsigned long | 开始时间 start_time | 请求的开始时间(以 jiffies 为单位) Request's starting time (in jiffies) |
无符号短 unsigned short | nr_phys_段 nr_phys_segments | 请求的物理段数 Number of physical segments of the request |
无符号短 unsigned short | nr_hw_segments nr_hw_segments | 请求的硬件段数 Number of hardware segments of the request |
整数 int | 标签 tag | 与请求关联的标签(仅适用于支持多个未完成数据传输的硬件设备) Tag associated with the request (only for hardware devices supporting multiple outstanding data transfers) |
字符* char * | 缓冲 buffer | 指向当前数据传输的内存缓冲区的指针( Pointer to the memory buffer of the
current data transfer ( |
整数 int | 参考计数 ref_count | 请求的参考计数器 Reference counter for the request |
请求队列_t * request_queue_t * | q q | 指向包含请求的请求队列描述符的指针 Pointer to the descriptor of the request queue containing the request |
结构请求列表 * struct request_list * | RL rl | 指向 Pointer to |
结构补全 * struct completion * | 等待 waiting | Completion for waiting for the end of the data transfers (see the section "Completions" in Chapter 5) |
空白 * void * | 特别的 special | 指向当请求包含对硬件设备的“特殊”命令时使用的数据的指针 Pointer to data used when the request includes a "special" command to the hardware device |
无符号整数 unsigned int | cmd_len cmd_len |
Length of the commands in the
|
无符号字符 [] unsigned char [] | 指令 cmd |
Buffer containing the pre-built
commands prepared by the request queue's |
无符号整数 unsigned int | 数据长度 data_len | 通常,该字段指向的缓冲区中数据的 Usually, the length of data in the
buffer pointed to by the |
空白 * void * | 数据 data | 设备驱动程序使用指针来跟踪要传输的数据 Pointer used by the device driver to keep track of the data to be transferred |
无符号整数 unsigned int | sense_len sense_len | 该字段指向的缓冲区的长度
Length of buffer pointed to by the
|
空白 * void * | 感觉 sense | 指向用于输出检测命令的缓冲区的指针 Pointer to buffer used for output of sense commands |
无符号整数 unsigned int | 暂停 timeout | 请求超时 Request's time-out |
结构体 struct 请求下午状态 * request_pm_state * | 下午 pm | 指向用于电源管理命令的数据结构的指针 Pointer to a data structure used for power-management commands |
每个请求由一个或多个生物结构组成。最初,通用块层创建仅包含一个生物的请求。稍后,I/O调度程序可以通过向原始bio添加新段或者通过将另一个bio结构链接到请求来“扩展”请求。当新数据在物理上与请求中已有的数据相邻时,这是可能的。请求描述符的字段bio指向请求中的第一个bio结构,而该biotail字段则指向最后一个bio。该
rq_for_each_bio宏实现一个循环,迭代请求中包含的所有 BIOS。
Each request consists of one or more bio structures. Initially,
the generic block layer creates a request including just one bio.
Later, the I/O scheduler may "extend" the request either by adding a
new segment to the original bio, or by linking another bio structure
into the request. This is possible when the new data is physically
adjacent to the data already in the request. The bio field of the request descriptor points
to the first bio structure in the request, while the biotail field points to the last bio. The
rq_for_each_bio macro implements a
loop that iterates over all bios included in a request.
请求描述符的几个字段可以动态改变。例如,一旦bio中引用的数据块全部被传输,该bio字段就会被更新,以便它指向请求列表中的下一个bio。同时,新的BIOS可以添加到请求列表的尾部,因此该biotail字段也可能发生变化。
Several fields of the request descriptor may dynamically change.
For instance, as soon as the chunks of data referenced in a bio have
all been transferred, the bio field
is updated so that it points to the next bio in the request list.
Meanwhile, new bios can be added to the tail of the request list, so
the biotail field may also
change.
在传输磁盘扇区时,I/O 调度程序或设备驱动程序会修改请求描述符的其他几个字段。例如,该nr_sectors字段存储整个请求中尚未传输的扇区数,而该current_nr_sectors字段存储当前bio中尚未传输的扇区数。
Several other fields of the request descriptor are modified
either by the I/O scheduler or the device driver while the disk
sectors are being transferred. For instance, the nr_sectors field stores the number of
sectors yet to be transferred in the whole request, while the current_nr_sectors field stores the number
of sectors yet to be transferred in the current bio.
该flags字段存储了大量的标志,如表14-8所示。到目前为止,最重要的一个是REQ_RW,它决定了数据传输的方向。
The flags field stores a
large number of flags, which are listed in Table 14-8. The most
important one is, by far, REQ_RW,
which determines the direction of the data transfer.
表 14-8。请求描述符的标志
Table 14-8. The flags of the request descriptor
旗帜 Flag | 描述 Description |
|---|---|
| 数据传输方向: Direction of data transfer: |
| 请求表示在发生错误时不重试 I/O 操作 Requests says to not retry the I/O operation in case of error |
| 请求充当 I/O 调度程序的屏障 Request acts as a barrier for the I/O scheduler |
| 请求充当 I/O 调度程序和设备驱动程序的屏障 - 它应该在较旧的请求之后和较新的请求之前处理 Request acts as a barrier for the I/O scheduler and the device driver—it should be processed after older requests and before newer ones |
| 请求包括正常的读或写 I/O 数据传输 Request includes a normal read or write I/O data transfer |
| 请求不应扩展或与其他请求合并 Request should not be extended or merged with other requests |
| 请求正在处理中 Request is being processed |
| 不要调用 Do not invoke the |
| 请求被标记——也就是说,它指的是可以同时管理许多未完成的数据传输的硬件设备 Request is tagged—that is, it refers to a hardware device that can manage many outstanding data transfers at the same time |
| 请求包括要发送到硬件设备的直接命令 Request includes a direct command to be sent to the hardware device |
| 与之前的标志相同,但该命令包含在简介中 Same as previous flag, but the command is included in a bio |
| 请求包括“Sense”请求命令(对于 SCSI 和 ATAPI 设备) Request includes a "sense" request command (for SCSI and ATAPI devices) |
| 当请求中的感知或直接命令未按预期工作时设置 Set when a sense or direct command in the request did not work as expected |
| 请求表示在发生 I/O 错误时不生成内核消息 Request says to not generate kernel messages in case of I/O errors |
| 请求包括针对硬件设备的特殊命令(例如驱动器重置) Request includes a special command for the hardware device (e.g., drive reset) |
| 请求包含针对 IDE 磁盘的特殊命令 Request includes a special command for IDE disks |
| 请求包含针对 IDE 磁盘的特殊命令 Request includes a special command for IDE disks |
| 请求包含针对 IDE 磁盘的特殊命令 Request includes a special command for IDE disks |
| request 替换队列前面的当前请求(仅适用于 IDE 磁盘) Request replaces the current request in front of the queue (only for IDE disks) |
| 请求包括暂停硬件设备的电源管理命令 Request includes a power-management command to suspend the hardware device |
| 请求包括唤醒硬件设备的电源管理命令 Request includes a power-management command to awaken the hardware device |
| 请求包含关闭硬件设备的电源管理命令 Request includes a power-management command to switch off the hardware device |
| 请求包含要发送到磁盘控制器的“刷新队列”命令 Request includes a "flush queue" command to be sent to the disk controller |
| 请求包含“刷新队列”命令,该命令已发送到磁盘控制器 Request includes a "flush queue" command, which has been sent to the disk controller |
在非常重的负载和高磁盘活动的情况下,有限数量的空闲动态内存可能成为想要将新请求添加到请求队列中的进程的瓶颈
q。为了应对这种情况,每个request_queue
描述符都包含一个request_list数据结构,其中包括:
The limited amount of free dynamic memory may become,
under very heavy loads and high disk activity, a bottleneck for
processes that want to add a new request into a request queue
q. To cope with this kind of
situation, each request_queue
descriptor includes a request_list data structure, which
consists of:
A pointer to a memory pool of request descriptors (see the section "Memory Pools" in Chapter 8).
READ两个计数器,分别用于记录分配给和
请求的请求描述符数量WRITE。
Two counters for the number of requests descriptors
allocated for READ and
WRITE requests,
respectively.
READ两个标志分别指示最近对或请求的分配是否
WRITE失败。
Two flags indicating whether a recent allocation for a
READ or WRITE request, respectively,
failed.
两个等待队列分别存储休眠进程的可用描述符READ和WRITE请求描述符。
Two wait queues storing the processes sleeping for
available READ and WRITE request descriptors,
respectively.
等待请求队列被刷新(清空)的进程的等待队列。
A wait queue for the processes waiting for a request queue to be flushed (emptied).
该blk_get_request( )
函数尝试从给定请求队列的内存池中获取空闲的请求描述符;如果内存不足并且内存池已耗尽,则该函数要么使当前进程进入睡眠状态,要么(如果内核控制路径无法阻塞)返回NULL。如果分配成功,该函数将rl请求队列的数据结构的地址存储在request_list请求描述符的字段中。该blk_put_request( )
函数释放一个请求描述符;如果其引用计数器变为零,则描述符将返回到从中获取它的内存池。
The blk_get_request( )
function tries to get a free request descriptor from the memory pool
of a given request queue; if memory is scarce and the memory pool is
exhausted, the function either puts the current process to sleep
or—if the kernel control path cannot block—returns NULL. If the allocation succeeds, the
function stores in the rl field
of the request descriptor the address of the request_list data structure of the request
queue. The blk_put_request( )
function releases a request descriptor; if its reference counter
becomes zero, the descriptor is given back to the memory pool from
which it was taken.
每个请求队列都有允许的待处理请求的最大数量。nr_requests请求描述符的字段存储每个数据传输方向允许的待处理请求的最大数量。默认情况下,队列最多有 128 个待处理的读请求和 128 个待处理的写请求。如果挂起的读(写)请求数量超过
,则通过在请求队列描述符字段中设置( ) 标志,nr_requests将队列标记为已满,并且尝试添加该数据传输方向请求的可阻塞进程将进入休眠状态。数据结构对应的等待队列。QUEUE_FLAG_READFULLQUEUE_FLAG_WRITEFULLqueue_flagsrequest_list
Each request queue has a maximum number of allowed pending
requests. The nr_requests field
of the request descriptor stores the maximum number of allowed
pending requests for each data transfer direction. By default, a
queue has at most 128 pending read requests and 128 pending write
requests. If the number of pending read (write) requests exceeds
nr_requests, the queue is marked
as full by setting the QUEUE_FLAG_READFULL (QUEUE_FLAG_WRITEFULL) flag in the queue_flags field of the request queue
descriptor, and blockable processes trying to add requests for that
data transfer direction are put to sleep in the corresponding wait
queue of the request_list data
structure.
填满的请求队列会对系统性能产生负面影响,因为它会迫使许多进程在等待 I/O 数据传输完成时休眠。因此,如果给定方向的待处理请求数量超过nr_congestion_on请求描述符字段中存储的值(默认情况下为 113),则内核将认为队列拥塞并尝试减慢新请求的创建速率。当待处理请求的数量低于该字段的值nr_congestion_off
(默认情况下为 111)时,拥塞的请求队列将变得不拥塞。这blk_congestion_wait( )函数使当前进程进入睡眠状态,直到任何请求队列不再拥塞或超时。
A filled-up request queue impacts negatively on the system's
performance, because it forces many processes to sleep while waiting
for the completion of I/O data transfers. Thus, if the number of
pending requests for a given direction exceeds the value stored in
the nr_congestion_on field of the
request descriptor (by default, 113), the kernel regards the queue
as congested and tries to slow down the
creation rate of the new requests. A congested request queue becomes
uncongested when the number of pending requests falls below the
value of the nr_congestion_off
field (by default, 111). The blk_congestion_wait( ) function puts the
current process to sleep until any request queue becomes uncongested
or a time-out elapses.
正如我们之前所看到的,延迟块设备驱动程序的激活是有利的,以便增加对相邻块进行集群请求的机会。这种延迟是通过一种称为设备插拔的 技术来实现的。[ * ]只要插入块设备驱动程序,即使驱动程序队列中有需要处理的请求,设备驱动程序也不会激活。
As we saw earlier, it's expedient to delay activation of the block device driver in order to increase the chances of clustering requests for adjacent blocks. The delay is accomplished through a technique known as device plugging and unplugging.[*] As long as a block device driver is plugged, the device driver is not activated even if there are requests to be processed in the driver's queues.
该blk_plug_device( )
函数插入一个块设备,或者更准确地说,插入一个由某个块设备驱动程序提供服务的请求队列。q本质上,该函数接收请求队列描述符的地址作为参数。它设置
QUEUE_FLAG_PLUGGED字段中的位
q->queue_flags;然后,它重新启动嵌入现场的动态计时器q->unplug_timer。
The blk_plug_device( )
function plugs a block device—or more precisely, a request queue
serviced by some block device driver. Essentially, the function
receives as an argument the address q of a request queue descriptor. It sets the
QUEUE_FLAG_PLUGGED bit in the
q->queue_flags field; then, it
restarts the dynamic timer embedded in the q->unplug_timer field.
该blk_remove_plug( )
函数拔出请求队列q:它清除QUEUE_FLAG_PLUGGED
标志并取消动态计时器的执行q->unplug_timer。当所有“可见”的可合并请求都已添加到队列中时,内核可以显式调用此函数。此外,如果队列中待处理请求的数量超过请求unplug_thres队列描述符字段中存储的值(默认情况下为 4),I/O 调度程序将拔出请求队列。
The blk_remove_plug( )
function unplugs a request queue q:
it clears the QUEUE_FLAG_PLUGGED
flag and cancels the execution of the q->unplug_timer dynamic timer. This
function can be explicitly invoked by the kernel when all mergeable
requests "in sight" have been added to the queue. Moreover, the I/O
scheduler unplugs a request queue if the number of pending requests in
the queue exceeds the value stored in the unplug_thres field of the request queue
descriptor (by default, 4).
如果设备保持插入状态一段时间
q->unplug_delay(通常为3毫秒),则激活的动态定时器到期blk_plug_device( ),从而
blk_unplug_timeout( )执行该函数。结果,kblockd
为工作队列提供服务的内核线程被唤醒(请参阅第 4 章中的“工作队列”kblockd_workqueue部分)。这个内核线程执行地址存储在
数据结构中的函数——即函数。反过来,该函数会调用请求队列的方法,该方法通常由该函数来实现。该
函数负责拔出块设备:首先,它检查队列是否仍然处于活动状态;然后,它调用; 最后,它执行策略例程——
方法——开始处理队列中的下一个请求(参见“q->unplug_workblk_unplug_work(
)q->unplug_fngeneric_unplug_device( )generic_unplug_device( )blk_remove_plug( )request_fn设备驱动程序注册和初始化”在本章后面)。
If a device remains plugged for a time interval of length
q->unplug_delay (usually 3
milliseconds), the dynamic timer activated by blk_plug_device( ) elapses, thus the
blk_unplug_timeout( ) function is
executed. As a consequence, the kblockd
kernel thread servicing the kblockd_workqueue work queue is awakened
(see the section "Work
Queues" in Chapter
4). This kernel thread executes the function whose address is
stored in the q->unplug_work
data structure—that is, the blk_unplug_work(
) function. In turn, this function invokes the q->unplug_fn method of the request queue,
which is usually implemented by the generic_unplug_device( ) function. The
generic_unplug_device( ) function
takes care of unplugging the block device: first, it checks whether
the queue is still active; then, it invokes blk_remove_plug( ); and finally, it executes
the strategy routine—request_fn
method—to start processing the next request in the queue (see the
section "Device Driver
Registration and Initialization" later in this chapter).
当新请求添加到请求队列时,通用块层调用 I/O 调度程序来确定新元素在队列中的确切位置。I/O 调度程序尝试使请求队列按扇区排序。如果要处理的请求是从列表中顺序取出的,则磁盘寻道量将显着减少,因为磁头以线性方式从内磁道移动到外磁道(反之亦然),而不是从一个磁道随机跳跃到另一个。这种启发式让人想起电梯在处理来自不同楼层的上行或下行请求时使用的算法。电梯单向运行;当朝一个方向到达最后预订的楼层时,电梯改变方向并开始向另一个方向移动。因此,I/O 调度程序也称为 电梯。
When a new request is added to a request queue, the generic block layer invokes the I/O scheduler to determine that exact position of the new element in the queue. The I/O scheduler tries to keep the request queue sorted sector by sector. If the requests to be processed are taken sequentially from the list, the amount of disk seeking is significantly reduced because the disk head moves in a linear way from the inner track to the outer one (or vice versa) instead of jumping randomly from one track to another. This heuristic is reminiscent of the algorithm used by elevators when dealing with requests coming from different floors to go up or down. The elevator moves in one direction; when the last booked floor is reached in one direction, the elevator changes direction and starts moving in the other direction. For this reason, I/O schedulers are also called elevators.
在重负载下,严格遵循扇区号顺序的 I/O 调度算法不会很好地工作。实际上,在这种情况下,数据传输的完成时间很大程度上取决于磁盘上数据的物理位置。因此,如果设备驱动程序正在处理靠近队列顶部(较低扇区号)的请求,并且具有较低扇区号的新请求不断添加到队列中,则队列尾部的请求很容易陷入饥饿。因此,I/O 调度算法相当复杂。
Under heavy load, an I/O scheduling algorithm that strictly follows the order of the sector numbers is not going to work well. In this case, indeed, the completion time of a data transfer strongly depends on the physical position of the data on the disk. Thus, if a device driver is processing requests near the top of the queue (lower sector numbers), and new requests with low sector numbers are continuously added to the queue, then the requests in the tail of the queue can easily starve. I/O scheduling algorithms are thus quite sophisticated.
目前,Linux 2.6 提供四种不同类型的 I/O 调度程序(或电梯),称为“Anticipatory”、“Deadline”、“CFQ(完全公平队列)”和“Noop(无操作)”。内核对大多数块设备使用的默认电梯是在引导时使用内核参数elevator= <name>指定的,其中<name>是以下值之一:
as、deadline、cfq和noop。如果没有给出启动时间参数,内核将使用“Anticipatory”I/O 调度程序。无论如何,设备驱动程序可以用另一个电梯替换默认的电梯;设备驱动程序还可以定义其自定义 I/O 调度算法,但很少这样做。
Currently, Linux 2.6 offers four different types of I/O
schedulers—or elevators—called "Anticipatory," "Deadline," "CFQ
(Complete Fairness Queueing)," and "Noop (No Operation)." The default
elevator used by the kernel for most block devices is specified at
boot time with the kernel parameter elevator= <name>,
where <name> is one of the following:
as, deadline, cfq, and noop. If no boot time argument is given, the
kernel uses the "Anticipatory" I/O scheduler. Anyway, a device driver
can replace the default elevator with another one; a device driver can
also define its custom I/O scheduling algorithm, but this is very
seldom done.
此外,系统管理员可以在运行时更改特定块设备的 I/O 调度程序。例如,要更改第一个IDE通道的主盘中使用的I/O调度程序,管理员可以将电梯名称写入sysfs的 /sys/block/hda/queue/scheduler文件中 特殊文件系统(请参阅第 13 章中的“ sysfs 文件系统” 部分)。
Furthermore, the system administrator can change at runtime the I/O scheduler for a specific block device. For instance, to change the I/O scheduler used in the master disk of the first IDE channel, the administrator can write an elevator name into the /sys/block/hda/queue/scheduler file of the sysfs special filesystem (see the section "The sysfs Filesystem" in Chapter 13).
请求队列中使用的 I/O 调度程序算法由类型为 的
电梯对象表示elevator_t;它的地址存储在elevator请求队列描述符的字段中。电梯对象包括涵盖电梯所有可能操作的几个方法:将电梯链接和取消链接到请求队列、向队列添加和合并请求、从队列中删除请求、从队列中获取下一个要处理的请求,以及很快。电梯对象还存储一个表的地址,该表包括处理请求队列所需的所有信息。此外,每个请求描述符包括一个elevator_private指向 I/O 调度程序用来处理请求的附加数据结构的字段。
The I/O scheduler algorithm used in a request queue is
represented by an elevator object of type
elevator_t; its address is stored
in the elevator field of the
request queue descriptor. The elevator object includes several methods
covering all possible operations of the elevator: linking and
unlinking the elevator to a request queue, adding and merging requests
to the queue, removing requests from the queue, getting the next
request to be processed from the queue, and so on. The elevator object
also stores the address of a table including all information required
to handle the request queue. Furthermore, each request descriptor
includes an elevator_private field
that points to an additional data structure used by the I/O scheduler
to handle the request.
现在让我们从最简单的一种到最复杂的一种简要描述四种 I/O 调度算法。请注意,设计 I/O 调度程序与设计 CPU 调度程序非常相似(请参阅 第 7 章):启发式方法和所采用的常量值是大量测试和基准测试的结果。
Let us now briefly describe the four I/O scheduling algorithms, from the simplest one to the most sophisticated one. Be warned that designing an I/O scheduler is much like designing a CPU scheduler (see Chapter 7): the heuristics and the values of the adopted constants are the result of an extensive amount of testing and benchmarking.
一般来说,所有算法都使用
调度队列,其中包括根据设备驱动程序处理请求的顺序排序的所有请求——设备驱动程序要服务的下一个请求始终是队列中的第一个元素。调度队列。调度队列实际上是以请求queue_head队列描述符字段为根的请求队列。几乎所有算法还利用额外的队列对请求进行分类和排序。所有这些都允许设备驱动程序将 BIOS 添加到现有请求,并在必要时合并两个“相邻”请求。
Generally speaking, all algorithms make use of a
dispatch queue, which includes all requests
sorted according to the order in which the requests should be
processed by the device driver—the next request to be serviced by the
device driver is always the first element in the dispatch queue. The
dispatch queue is actually the request queue rooted at the queue_head field of the request queue
descriptor. Almost all algorithms also make use of additional queues
to classify and sort requests. All of them allow the device driver to
add bios to existing requests and, if necessary, to merge two
"adjacent" requests.
这是最简单的I/O调度算法。没有有序队列:新请求总是添加到调度队列的前端或尾部,并且下一个要处理的请求始终是队列中的第一个请求。
This is the simplest I/O scheduling algorithm. There is no ordered queue: new requests are always added either at the front or at the tail of the dispatch queue, and the next request to be processed is always the first request in the queue.
“完全公平排队”电梯的主要目标是确保在触发 I/O 请求的所有进程之间公平分配磁盘 I/O 带宽。为了实现此结果,电梯使用大量排序队列(默认情况下为 64 个)来存储来自不同进程的请求。每当一个请求被交给电梯时,内核都会调用一个哈希函数来转换当前进程的线程组标识符(通常它对应于PID,请参阅第3章中的“识别进程”部分)) 进入队列的索引;然后,电梯将新请求插入到该队列的尾部。因此,来自同一进程的请求总是插入到同一队列中。
The main goal of the "Complete Fairness Queueing" elevator is ensuring a fair allocation of the disk I/O bandwidth among all the processes that trigger the I/O requests. To achieve this result, the elevator makes use of a large number of sorted queues—by default, 64—that store the requests coming from the different processes. Whenever a requested is handed to the elevator, the kernel invokes a hash function that converts the thread group identifier of the current process (usually it corresponds to the PID, see the section "Identifying a Process" in Chapter 3) into the index of a queue; then, the elevator inserts the new request at the tail of this queue. Therefore, requests coming from the same process are always inserted in the same queue.
为了重新填充调度队列,电梯本质上以循环方式扫描 I/O 输入队列,选择第一个非空队列,并将一批请求从该队列移动到调度队列的尾部。
To refill the dispatch queue, the elevator essentially scans the I/O input queues in a round-robin fashion, selects the first nonempty queue, and moves a batch of requests from that queue into the tail of the dispatch queue.
除了调度队列之外,“Deadline”电梯还使用四个队列。其中两个——已排序的队列 — 分别包括读取和写入请求,根据其初始扇区号排序。另外两个—— 截止日期队列 — 包括根据“截止日期”排序的相同读取和写入请求。引入这些队列是为了避免 请求饥饿 ,当电梯策略长时间忽略某个请求时会发生这种情况,因为它更喜欢处理更接近最后服务的请求的其他请求。请求 截止时间本质上是一个到期计时器,当请求传递到电梯时,该计时器开始计时。默认情况下,读请求的过期时间为 500 毫秒,而写请求的过期时间为 5 秒 — 读请求比写请求具有特权,因为它们通常会阻塞发出它们的进程。截止时间可确保调度程序在等待很长时间的情况下查看请求,即使该请求的排序较低。
Besides the dispatch queue, the "Deadline" elevator makes use of four queues. Two of them—the sorted queues —include the read and write requests, respectively, ordered according to their initial sector numbers. The other two—the deadline queues —include the same read and write requests sorted according to their "deadlines." These queues are introduced to avoid request starvation , which occurs when the elevator policy ignores for a very long time a request because it prefers to handle other requests that are closer to the last served one. A request deadline is essentially an expire timer that starts ticking when the request is passed to the elevator. By default, the expire time of read requests is 500 milliseconds, while the expire time for write requests is 5 seconds—read requests are privileged over write requests because they usually block the processes that issued them. The deadline ensures that the scheduler looks at a request if it's been waiting a long time, even if it is low in the sort.
当电梯必须补充调度队列时,它首先确定下一个请求的数据方向。如果同时有读和写请求需要调度,电梯会选择“读”方向,除非“写”方向被丢弃太多次(以避免写请求匮乏)。
When the elevator must replenish the dispatch queue, it first determines the data direction of the next request. If there are both read and write requests to be dispatched, the elevator chooses the "read" direction, unless the "write" direction has been discarded too many times (to avoid write requests starvation).
接下来,电梯检查相对于所选方向的截止时间队列:如果队列中第一个请求的截止时间已过,则电梯将该请求移动到调度队列的尾部;它还会移动从已排序队列中取出的一批请求,从过期请求之后的请求开始。如果请求恰好在磁盘上物理相邻,则该批次的长度较长,否则较短。
Next, the elevator checks the deadline queue relative to the chosen direction: if the deadline of the first request in the queue is elapsed, the elevator moves that request to the tail of the dispatch queue; it also moves a batch of requests taken from the sorted queue, starting from the request following the expired one. The length of this batch is longer if the requests happen to be physically adjacent on disks, shorter otherwise.
最后,如果没有请求过期,电梯将分派一批请求,从排序队列中最后一个请求之后的请求开始。当光标到达排序队列的尾部时,搜索再次从顶部开始(“单向电梯”)。
Finally, if no request is expired, the elevator dispatches a batch of requests starting with the request following the last one taken from the sorted queue. When the cursor reaches the tail of the sorted queue, the search starts again from the top ("one-way elevator").
“预期”电梯是 Linux 提供的最复杂的 I/O 调度算法。基本上,它是“Deadline”电梯的演变,它借用了基本机制:有两个截止时间队列和两个排序队列;I/O 调度程序不断扫描已排序的队列,在读请求和写请求之间交替,但优先考虑读请求。扫描基本上是连续的,除非请求过期。读请求的默认过期时间为 125 毫秒,写请求的默认过期时间为 250 毫秒。然而,电梯遵循一些额外的启发式:
The "Anticipatory" elevator is the most sophisticated I/O scheduler algorithm offered by Linux. Basically, it is an evolution of the "Deadline" elevator, from which it borrows the fundamental mechanism: there are two deadline queues and two sorted queues; the I/O scheduler keeps scanning the sorted queues, alternating between read and write requests, but giving preference to the read ones. The scanning is basically sequential, unless a request expires. The default expire time for read requests is 125 milliseconds, while the default expire time for write requests is 250 milliseconds. The elevator, however, follows some additional heuristics:
在某些情况下,电梯可能会选择排序队列中当前位置后面的请求,从而强制磁盘头向后查找。通常,当后面的请求的查找距离小于排序队列中当前位置之后的请求的查找距离的一半时,就会发生这种情况。
In some cases, the elevator might choose a request behind the current position in the sorted queue, thus forcing a backward seek of the disk head. This happens, typically, when the seek distance for the request behind is less than half the seek distance of the request after the current position in the sorted queue.
电梯收集有关系统中每个进程触发的 I/O 操作模式的统计信息。在调度来自某个进程 P 的读取请求后,电梯立即检查排序队列中的下一个请求是否来自同一进程 P。如果是,则立即调度下一个请求。否则,电梯会查看收集到的有关进程 P 的统计信息:如果它确定进程 P 可能很快会发出另一个读取请求,那么它会停止一小段时间(默认情况下,大约 7 毫秒)。因此,电梯可能会预期来自进程 P 的读取请求,该进程在磁盘上与刚刚分派的请求“接近”。
The elevator collects statistics about the patterns of I/O operations triggered by every process in the system. Right after dispatching a read request that comes from some process P, the elevator checks whether the next request in the sorted queue comes from the same process P. If so, the next request is dispatched immediately. Otherwise, the elevator looks at the collected statistics about process P: if it decides that process P will likely issue another read request soon, then it stalls for a short period of time (by default, roughly 7 milliseconds). Thus, the elevator might anticipate a read request coming from process P that is "close" on disk to the request just dispatched.
正如本章前面的“提交请求”一节中所见,该generic_make_request( )函数调用
make_request_fn请求队列描述符的方法来将请求传输到 I/O 调度程序。该方法通常由_
_make_request( )函数来实现;它接收一个
request_queue描述符q和一个bio描述符作为其参数bio,并执行以下操作:
As seen in the section "Submitting a Request"
earlier in this chapter, the generic_make_request( ) function invokes the
make_request_fn method of the
request queue descriptor to transmit a request to the I/O scheduler.
This method is usually implemented by the _
_make_request( ) function; it receives as its parameters a
request_queue descriptor q and a bio descriptor bio, and it performs the following
operations:
如果需要,调用该blk_queue_bounce(
)函数来设置反弹缓冲区(见下文)。如果创建了反弹缓冲区,则该_ _make_request( )函数将对其进行操作,而不是对原始生物进行操作。
Invokes the blk_queue_bounce(
) function to set up a bounce buffer, if required (see
later). If a bounce buffer was created, the _ _make_request( ) function operates on
it rather than on the original bio.
调用 I/O 调度程序函数来elv_queue_empty( )检查请求队列中是否有待处理的请求 - 请注意,调度队列可能为空,但 I/O 调度程序的其他队列可能包含待处理的请求。如果没有待处理的请求,则调用该blk_plug_device(
)函数插入请求队列(请参阅本章前面的“激活块设备驱动程序”部分),并跳转到步骤 5。
Invokes the I/O scheduler function elv_queue_empty( ) to check whether
there are pending requests in the request queue—notice that the
dispatch queue might be empty, but other queues of the I/O
scheduler might contain pending requests. If there are no pending
requests, it invokes the blk_plug_device(
) function to plug the request queue (see the section
"Activating the Block
Device Driver" earlier in this chapter), and jumps to step
5.
这里的请求队列包括待处理的请求。调用elv_merge( )I/O 调度程序函数来检查新的 Bio 是否可以合并到现有请求中。该函数可能返回三个可能的值:
ELEVATOR_NO_MERGE:bio 不能包含在已存在的请求中:在这种情况下,函数跳转到步骤 5。
ELEVATOR_BACK_MERGE:该 Bio 可能会被添加为某个请求的最后一个 Bio
req:在这种情况下,该函数会调用该q->back_merge_fn方法来检查该请求是否可以扩展。req如果不是,则该函数跳转到步骤 5。否则,它将在 的列表尾部插入生物描述符并更新 的req字段。然后,它尝试将该请求与后续请求合并(新的bio可能会填补两个请求之间的空白)。
ELEVATOR_FRONT_MERGE:该 Bio 可以添加为某个请求的第一个 Bio req:在这种情况下,该函数会调用该q->front_merge_fn方法来检查该请求是否可以扩展。req如果不是,则跳转到步骤5。否则,它将bio描述符插入到 的列表
的头部
并更新 的req字段。然后,该函数尝试将该请求与前面的请求合并。
Here the request queue includes pending requests. Invokes
the elv_merge( ) I/O scheduler
function to check whether the new bio can be merged inside an
existing request. The function may return three possible
values:
ELEVATOR_NO_MERGE:
the bio cannot be included in an already existing request: in
that case, the function jumps to step 5.
ELEVATOR_BACK_MERGE:
the bio might be added as the last bio of some request
req: in that case, the
function invokes the q->back_merge_fn method to check
whether the request can be extended. If not, the function
jumps to step 5. Otherwise it inserts the bio descriptor at
the tail of the req's list
and updates the req's
fields. Then, it tries to merge the request with a following
request (the new bio might fill a hole between the two
requests).
ELEVATOR_FRONT_MERGE:
the bio can be added as the first bio of some request req: in that case, the function
invokes the q->front_merge_fn method to check
whether the request can be extended. If not, it jumps to step
5. Otherwise, it inserts the bio descriptor at the head of the
req's list and updates the
req's fields. Then, the
function tries to merge the request with the preceding
request.
该简介已合并到已有的请求中。跳转到步骤 7 终止该函数。
The bio has been merged inside an already existing request. Jumps to step 7 to terminate the function.
此处必须将简介插入到新请求中。分配一个新的请求描述符。如果没有空闲内存,该函数将挂起当前进程,除非设置了BIO_RW_AHEAD标志 in bio->bi_rw,这意味着 I/O 操作是预读(参见第 16 章);在这种情况下,函数调用bio_endio(
)并终止:数据传输将不会执行。有关 的说明,请参阅前面“提交请求”部分中bio_endio(
)的步骤 1 。generic_make_request( )
Here the bio must be inserted in a new request. Allocates a
new request descriptor. If there is no free memory, the function
suspends the current process, unless the BIO_RW_AHEAD flag in bio->bi_rw is set, which means that
the I/O operation is a read-ahead (see Chapter 16); in this case,
the function invokes bio_endio(
) and terminates: the data transfer will not be
executed. For a description of bio_endio(
), see step 1 of generic_make_request( ) in the earlier
section "Submitting a
Request."
初始化请求描述符的字段。尤其:
根据bio描述符的内容初始化存储扇区号、当前bio、当前段的各个字段。
REQ_CMD在字段中设置标志flags(这是正常的读或写操作)。
如果第一个 Bio 段的页帧位于低内存中,则将该buffer
字段设置为该缓冲区的线性地址。
设置rq_disk
带有bio->bi_bdev->bd_disk
地址的字段。
将简介插入请求列表中。
将start_time
字段设置为 的值jiffies。
Initializes the fields of the request descriptor. In particular:
Initializes the various fields that store the sector numbers, the current bio, and the current segment according to the contents of the bio descriptor.
Sets the REQ_CMD flag
in the flags field (this is
a normal read or write operation).
If the page frame of the first bio segment is in low
memory, it sets the buffer
field to the linear address of that buffer.
Sets the rq_disk
field with the bio->bi_bdev->bd_disk
address.
Inserts the bio in the request list.
Sets the start_time
field to the value of jiffies.
全做完了。然而,在终止之前,它会检查是否
设置了BIO_RW_SYNC标志。bio->bi_rw如果是这样,它会调用
请求队列来拔出驱动程序(请参阅本章前面的“激活块设备驱动程序generic_unplug_device( )”部分)。
All done. Before terminating, however, it checks whether the
BIO_RW_SYNC flag in bio->bi_rw is set. If so, it invokes
generic_unplug_device( ) on the
request queue to unplug the driver (see the section "Activating the Block Device
Driver" earlier in this chapter).
终止。
Terminates.
如果在调用之前请求队列不为空_ _make_request( ),则请求队列已经被拔出,或者即将被拔出 - 因为每个带有q待处理请求的插入请求队列都有一个正在运行的q->unplug_timer动态计时器。另一方面,如果请求队列为空,该_ _make_request( )函数将插入它。很快(退出时_ _make_request(
),如果BIO_RW_SYNC设置了bio标志)或稍后(在最坏的情况下,当拔出计时器衰减时),请求队列将被拔出。无论如何,最终块设备驱动程序的策略例程将处理调度队列中的请求(请参阅本章前面的“设备驱动程序注册和初始化”部分)。
If the request queue was not empty before invoking _ _make_request( ), either the request queue
is already unplugged, or it will be unplugged soon—because each
plugged request queue q with
pending requests has a running q->unplug_timer dynamic timer. On the
other hand, if the request queue was empty, the _ _make_request( ) function plugs it. Sooner
(on exiting from _ _make_request(
), if the BIO_RW_SYNC bio
flag is set) or later (in the worst case, when the unplug timer
decays), the request queue will be unplugged. In any case, eventually
the strategy routine of the block device driver will take care of the
requests in the dispatch queue (see the section "Device Driver Registration and
Initialization" earlier in this chapter).
该blk_queue_bounce( )
函数查看标志 inq->bounce_gfp和阈值 in
q->bounce_pfn以确定缓冲区是否弹跳 可能需要。当请求中的某些缓冲区位于高端内存并且硬件设备无法寻址它们时,就会发生这种情况。
The blk_queue_bounce( )
function looks at the flags in q->bounce_gfp and at the threshold in
q->bounce_pfn to determine
whether buffer bouncing might be required. This happens when some of the
buffers in the request are located in high memory and the hardware
device is not able to address them.
旧版 ISA 总线 DMA 仅处理 24 位物理地址。在这种情况下,缓冲区弹跳阈值设置为 16 MB,即页帧号 4096。但是,块设备驱动程序在处理较旧的设备时通常不依赖于缓冲区弹跳;相反,他们更喜欢直接在内存区域中分配 DMA 缓冲区ZONE_DMA。
Older DMA for ISA buses only handled 24-bit physical
addresses. In this case, the buffer bouncing threshold is set to 16
MB, that is, to page frame number 4096. Block device drivers,
however, do not usually rely on buffer bouncing when dealing with
older devices; rather, they prefer to directly allocate the DMA
buffers in the ZONE_DMA memory
zone.
如果硬件设备无法处理高端内存中的缓冲区,该函数将检查bio中的某些缓冲区是否确实必须被退回。在这种情况下,它会复制bio描述符,从而创建一个bounce bio;然后,对于编号等于或大于 的每个段的页框
q->bounce_pfn,执行以下步骤:
If the hardware device cannot cope with buffers in high
memory, the function checks whether some of the buffers in the bio
must really be bounced. In this case, it makes a copy of the bio
descriptor, thus creating a bounce bio; then,
for each segment's page frame having number equal to or greater than
q->bounce_pfn, it performs the
following steps:
根据分配标志在ZONE_NORMAL或内存区域中分配页框。ZONE_DMA
Allocates a page frame in the ZONE_NORMAL or ZONE_DMA memory zone, according to the
allocation flags.
更新bv_page
反弹bio中的段字段,使其指向新页框的描述符。
Updates the bv_page
field of the segment in the bounce bio so that it points to the
descriptor of the new page frame.
如果bio->bio_rw
指定写操作,则调用kmap( )将高位内存页临时映射到内核地址空间,将高位内存页复制到低位内存页,并调用kunmap( )释放映射。
If bio->bio_rw
specifies a write operation, it invokes kmap( ) to temporarily map the high
memory page in the kernel address space, copies the high memory
page onto the low memory page, and invokes kunmap( ) to release the
mapping.
blk_queue_bounce( )
然后该函数BIO_BOUNCED在反弹bio中设置标志,初始化bi_end_io
反弹bio的特定方法,最后将bi_private指向原始bio的指针存储在反弹bio的字段中。当反弹bio上的I/O数据传输终止时,实现该bi_end_io方法的函数将数据复制到高内存缓冲区(仅用于读取操作)并释放反弹bio。
The blk_queue_bounce( )
function then sets the BIO_BOUNCED flag in the bounce bio,
initializes a specific bi_end_io
method for the bounce bio, and finally stores in the bi_private field of the bounce bio the
pointer to the original bio. When the I/O data transfer on the
bounce bio terminates, the function that implements the bi_end_io method copies the data to the
high memory buffer (only for a read operation) and releases the
bounce bio.
块设备驱动程序是Linux块子系统的最底层组件。它们从 I/O 调度程序获取请求,并执行处理这些请求所需的任何操作。
Block device drivers are the lowest component of the Linux block subsystem. They get requests from I/O scheduler, and do whatever is required to process them.
当然,块设备驱动程序集成在第 13 章“设备驱动程序模型”
部分中描述的设备驱动程序模型中。因此,它们中的每一个都指的是一个
描述符;此外,驱动程序处理的每个磁盘都与一个描述符相关联。然而,这些描述符相当通用:块 I/O 子系统必须存储系统中每个块设备的附加信息。device_driverdevice
Block device drivers are, of course, integrated within the device
driver model described in the section "The Device Driver Model" in
Chapter 13. Therefore, each
of them refers to a device_driver
descriptor; moreover, each disk handled by the driver is associated with
a device descriptor. These
descriptors, however, are rather generic: the block I/O subsystem must
store additional information for each block device in the system.
一个块设备驱动程序可以处理多个块设备。例如,IDE 设备驱动程序可以处理多个 IDE 磁盘,每个磁盘都是一个单独的块设备。此外,每个磁盘通常都是分区的,每个分区都可以看作一个逻辑块设备。显然,块设备驱动程序必须处理在与相应块设备关联的块设备文件上发出的所有 VFS 系统调用。
A block device driver may handle several block devices. For instance, the IDE device driver can handle several IDE disks, each of which is a separate block device. Furthermore, each disk is usually partitioned, and each partition can be seen as a logical block device. Clearly, the block device driver must take care of all VFS system calls issued on the block device files associated with the corresponding block devices.
每个块设备由一个描述符表示block_device,其字段列于表14-9中。
Each block device is represented by a block_device descriptor, whose fields are
listed in Table
14-9.
表 14-9。块设备描述符的字段
Table 14-9. The fields of the block device descriptor
所有块设备描述符都插入到一个全局列表中,该列表的头由变量表示all_bdevs;列表链接的指针位于bd_list块设备描述符的字段中。
All block device descriptors are inserted in a global list,
whose head is represented by the all_bdevs variable; the pointers for list
linkage are in the bd_list field of
the block device descriptor.
如果块设备描述符指的是磁盘分区,则该字段
bd_contains指向与整个磁盘关联的块设备描述符,而该字段则bd_part指向分区描述符
(参见本章前面的“表示磁盘和磁盘分区hd_struct”一节))。否则,如果块设备描述符指的是整个磁盘,则该字段指向块设备描述符本身,该字段记录磁盘上的分区被打开了多少次。bd_containsbd_part_count
If the block device descriptor refers to a disk partition, the
bd_contains field points to the
descriptor of the block device associated with the whole disk, while
the bd_part field points to the
hd_struct partition descriptor (see
the section "Representing
Disks and Disk Partitions" earlier in this chapter). Otherwise,
if the block device descriptor refers to a whole disk, the bd_contains field points to the block device
descriptor itself, and the bd_part_count field records how many times
the partitions on the disk have been opened.
该字段存储代表块设备持有者bd_holder的线性地址。持有者不是为设备的 I/O 数据传输提供服务的块设备驱动程序;相反,它是一个使用设备并具有独占的特殊权限的内核组件(例如,它可以自由地使用块设备描述符的字段)。通常,块设备的持有者是安装在其上的文件系统。当块设备文件被打开以进行独占访问时,会发生另一种常见情况:持有者是相应的文件对象。bd_private
The bd_holder field stores a
linear address representing the holder of the
block device. The holder is not the block device driver that services
the I/O data transfers of the device; rather, it is a kernel component
that makes use of the device and has exclusive, special privileges
(for instance, it can freely use the bd_private field of the block device
descriptor). Typically, the holder of a block device is the filesystem
mounted over it. Another common case occurs when a block device file
is opened for exclusive access: the holder is the corresponding file
object.
该bd_claim( )函数设置bd_holder指定地址的字段;相反,该bd_release( )函数将该字段重置为
NULL。但请注意,同一个内核组件可以调用bd_claim(
)多次;每次调用都会增加该bd_holders字段。为了释放块设备,内核组件必须调用bd_release( )相应的次数。
The bd_claim( ) function sets
the bd_holder field with a
specified address; conversely, the bd_release( ) function resets the field to
NULL. Be aware, however, that the
same kernel component can invoke bd_claim(
) many times; each invocation increases the bd_holders field. To release the block
device, the kernel component must invoke bd_release( ) a corresponding number of
times.
图 14-3 涉及整个磁盘,并说明了块设备描述符如何链接到块 I/O 子系统的其他主要数据结构。
Figure 14-3 refers to a whole disk and illustrates how the block device descriptors are linked to the other main data structures of the block I/O subsystem.
图 14-3。将块设备描述符与块子系统的其他结构链接
Figure 14-3. Linking the block device descriptors with the other structures of the block subsystem
当内核收到打开块设备文件的请求时,首先要判断该设备文件是否已经打开。事实上,如果文件已经打开,内核不得创建并初始化新的块设备描述符;相反,它应该更新已经存在的块设备描述符。更复杂的是,具有相同主设备号和次设备号但路径名不同的块设备文件会被 VFS 视为不同的文件,尽管它们实际上引用相同的块设备。因此,内核无法通过简单地检查inode缓存中是否存在来确定块设备是否已经在使用中块设备文件的对象的。
When the kernel receives a request for opening a block device file, it must first determine whether the device file is already open. In fact, if the file is already open, the kernel must not create and initialize a new block device descriptor; rather, it should update the already existing block device descriptor. To complicate life, block device files that have the same major and minor numbers but different pathnames are viewed by the VFS as different files, although they really refer to the same block device. Therefore, the kernel cannot determine whether a block device is already in use by simply checking for the existence in the inode cache of an object for a block device file.
主次编号与对应的块设备描述符之间的关系通过
bdev来维护 特殊文件系统(请参阅第 12 章中的“特殊文件系统”部分)。每个块设备描述符都耦合了一个bdev
特殊文件:块设备描述符的字段指向对应的
bdev inode;相反,这样的索引节点对块设备的主编号和次编号以及相应描述符的地址进行编码。bd_inode
The relationship between a major and minor number and the
corresponding block device descriptor is maintained through the
bdev special filesystem (see the section "Special Filesystems"
in Chapter 12). Each
block device descriptor is coupled with a bdev
special file: the bd_inode field
of the block device descriptor points to the corresponding
bdev inode; conversely, such an inode encodes
both the major and minor numbers of the block device and the address
of the corresponding descriptor.
该函数接收块设备的主设备号和次设备号作为其参数:它在bdevbdget( )文件系统中查找关联的 inode;如果这样的索引节点不存在,该函数将分配一个新的索引节点和新的块设备描述符。在任何情况下,该函数都会返回与给定主设备号和次设备号相对应的块设备描述符的地址。
The bdget( ) function
receives as its parameter the major and minor numbers of a block
device: It looks up in the bdev filesystem the
associated inode; if such inode does not exist, the function
allocates a new inode and new block device descriptor. In any case,
the function returns the address of the block device descriptor
corresponding to given major and minor numbers.
一旦找到块设备的块设备描述符,内核就可以通过检查该字段的值来确定该块设备当前是否正在使用:bd_openers如果为正,则该块设备已在使用中(可能通过以下方式)不同的设备文件)。内核还维护与打开的块设备文件相关的索引节点对象列表。该列表植根于
bd_inodes块设备描述符的字段;inode 对象的字段i_devices
存储此列表中上一个和下一个元素的指针。
Once the block device descriptor for a block device has been
found, the kernel can determine whether the block device is
currently in use by checking the value of the bd_openers field: if it is positive, the
block device is already in use (possibly by means of a different
device file). The kernel also maintains a list of inode objects
relative to opened block device files. This list is rooted at the
bd_inodes field of the block
device descriptor; the i_devices
field of the inode object stores the pointers for the previous and
next element in this list.
现在让我们解释一下为块设备设置新设备驱动程序所涉及的基本步骤。显然,下面的描述非常简洁,但了解块 I/O 子系统使用的主要数据结构如何以及何时初始化可能很有用。
Let's now explain the essential steps involved in setting up a new device driver for a block device. Clearly, the description that follows is very succinct, nevertheless it could be useful to understand how and when the main data structures used by the block I/O subsystem are initialized.
我们默默地省略了第 13 章中已经提到的各种设备驱动程序所需的许多步骤。例如,我们跳过注册驱动程序本身所需的所有步骤(请参阅第 13 章中的“设备驱动程序模型”部分)。通常,块设备属于标准总线体系结构,例如 PCI 或 SCSI,并且内核提供辅助函数,作为副作用,将驱动程序注册到设备驱动程序模型中。
We silently omit many steps required for all kinds of device drivers and already mentioned in Chapter 13. For example, we skip all steps required for registering the driver itself (see the section "The Device Driver Model" in Chapter 13). Usually, the block device belongs to a standard bus architecture such as PCI or SCSI, and the kernel offers helper functions that, as a side effect, register the driver in the device driver model.
首先,设备驱动程序需要一个自定义描述符
foo类型
来保存驱动硬件设备所需的数据。对于每个设备,描述符将存储诸如用于对设备进行编程的I/O端口、设备引发的中断的IRQ线、设备的内部状态等信息。描述符还必须包含块 I/O 子系统所需的一些字段:foo _dev_t
First of all, the device driver needs a custom descriptor
foo of type
foo _dev_t holding the data required to drive
the hardware device. For every device, the descriptor will store
information such as the I/O ports used to program the device, the
IRQ line of the interrupts raised by the device, the internal status
of the device, and so on. The descriptor must also include a few
fields required by the block I/O subsystem:
结构foo_dev_t { [...] spinlock_t 锁; 结构 gendisk *gd; [...] }foo;
structfoo_dev_t { [...] spinlock_t lock; struct gendisk *gd; [...] }foo;
该lock字段是一个自旋锁,用于保护
foo描述符的字段;它的地址通常被传递给内核辅助函数,从而可以保护特定于驱动程序的块 I/O 子系统的数据结构。该
gd字段是指向描述符的指针
gendisk,该描述符表示由该驱动程序处理的整个块设备(磁盘)。
The lock field is a spin
lock used to protect the fields of the
foo descriptor; its address is often
passed to kernel helper functions, which can thus protect the data
structures of the block I/O subsystem specific to the driver. The
gd field is a pointer to the
gendisk descriptor that
represents the whole block device (disk) handled by this
driver.
保留主号码
Reserving the major number
设备驱动程序必须为其自己的目的保留一个主设备号。传统上,这是通过调用以下register_blkdev( )函数来完成的:
The device driver must reserve a major number for its own
purposes. Traditionally, this is done by invoking the register_blkdev( ) function:
err = register_blkdev( FOO _MAJOR, " foo");
if (err) 转到 error_major_is_busy;err = register_blkdev(FOO_MAJOR, "foo");
if (err) goto error_major_is_busy;该函数与第 13 章“分配设备编号”register_chrdev( )部分中介绍的非常相似:它保留主编号并将名称foo与其关联。请注意,无法分配次要数字的子范围,因为没有
;的类似物。此外,在保留的主设备号和驱动程序的数据结构之间没有建立任何链接。唯一可见的效果
是在/proc/devices特殊文件的注册主编号列表中包含一个新项目。FOO _MAJORregister_chrdev_region( )register_blkdev( )
This function is very similar to register_chrdev( ) presented in the
section "Assigning
Device Numbers" in Chapter 13: it reserves the
major number FOO _MAJOR and associates the name foo to it. Notice that there is no way to
allocate a subrange of minor numbers, because there is no analog of
register_chrdev_region( );
moreover, no link is established between the reserved major number
and the data structures of the driver. The only visible effect of
register_blkdev( ) is to include
a new item in the list of registered major numbers in the /proc/devices special file.
foo
在使用驱动程序之前,必须正确初始化描述符的所有字段。为了初始化与块 I/O 子系统相关的字段,设备驱动程序可以执行以下指令:
All the fields of the foo
descriptor must be initialized properly before making use of the
driver. To initialize the fields related to the block I/O subsystem,
the device driver could execute the following instructions:
spin_lock_init(& foo .lock); foo .gd = alloc_disk(16); if (! foo .gd) 转到 error_no_gendisk;
spin_lock_init(&foo.lock); foo.gd = alloc_disk(16); if (!foo.gd) goto error_no_gendisk;
驱动程序初始化自旋锁,然后分配磁盘描述符。如图 14-3所示,该gendisk结构对于块 I/O 子系统至关重要,因为它引用了许多其他数据结构。该
alloc_disk( )函数还分配存储磁盘分区描述符的数组。函数的参数是hd_struct数组中元素的数量;值 16 表示驱动程序可以支持最多包含 15 个分区的磁盘(不使用分区 0)。
The driver initializes the spin lock, then allocates the disk
descriptor. As shown earlier in Figure 14-3, the gendisk structure is crucial for the block
I/O subsystem, because it references many other data structures. The
alloc_disk( ) function allocates
also the array that stores the partition descriptors of the disk.
The argument of the function is the number of hd_struct elements in the array; the value
16 means that the driver can support disks containing up to 15
partitions (partition 0 is not used).
Next, the driver initializes some fields of the
gendisk descriptor:
foo.gd->private_data = &foo;foo.gd->主要 =FOO_MAJOR;foo.gd->first_minor = 0;foo.gd->未成年人=16; set_capacity(foo.gd,foo_disk_capacity_in_sectors); strcpy(foo.gd->disk_name, "foo");foo.gd->fops = &foo_ops;
foo.gd->private_data = &foo;foo.gd->major =FOO_MAJOR;foo.gd->first_minor = 0;foo.gd->minors = 16; set_capacity(foo.gd,foo_disk_capacity_in_sectors); strcpy(foo.gd->disk_name, "foo");foo.gd->fops = &foo_ops;
描述符的地址foo保存在结构体中private_data,gendisk以便块 I/O 子系统作为方法调用的低级驱动程序函数可以快速找到驱动程序描述符 - 如果驱动程序可以同时处理多个磁盘,则这会提高效率一次。该函数使用 512 字节扇区中的磁盘大小来set_capacity( )初始化该字段 - 该值可能是通过探测硬件并询问磁盘参数来确定的。capacity
The address of the foo descriptor
is saved in the private_data of
the gendisk structure, so that
low-level driver functions invoked as methods by the block I/O
subsystem can quickly find the driver descriptor—this improves
efficiency if the driver can handle more than one disk at a time.
The set_capacity( ) function
initializes the capacity field
with the size of the disk in 512-byte sectors—this value is likely
determined by probing the hardware and asking about the disk
parameters.
fops描述符的字段
使用gendisk块设备方法的自定义表的地址进行初始化(参见
本章前面的表 14-4 )。[ * ]设备驱动程序的表很可能包含特定于设备驱动程序的函数。作为示例,如果硬件设备支持可移动磁盘,则通用块层可以调用该方法来检查自块设备上的最后安装或打开操作以来磁盘是否被更改。这种检查通常是通过向硬件控制器发送一些低级命令来完成的,从而实现
foo
_opsmedia_changedmedia_changed方法始终特定于设备驱动程序。
The fops field of the
gendisk descriptor is initialized
with the address of a custom table of block device methods (see
Table 14-4
earlier in this chapter).[*] Quite likely, the foo
_ops table of the device driver
includes functions specific to the device driver. As an example, if
the hardware device supports removable disks, the generic block
layer may invoke the media_changed method to check whether the
disk is changed since the last mount or open operation on the block
device. This check is usually done by sending some low-level
commands to the hardware controller, thus the implementation of the
media_changed method is always
specific to the device driver.
类似地,ioctl仅当通用块层不知道如何处理某些ioctl命令时才调用该方法。例如,该方法通常在以下情况下被调用:ioctl( ) 系统调用询问磁盘几何结构 ,即磁盘使用的柱面数、磁道数、扇区数和磁头数。因此,该方法的实现是特定于设备驱动程序的。
Similarly, the ioctl method
is only invoked when the generic block layer does not know how to
handle some ioctl command. For
instance, the method is typically invoked when an ioctl( ) system call asks about the disk
geometry , that is, the number of cylinders, tracks, sectors,
and heads used by the disk. Thus, the implementation of this method
is specific to the device driver.
我们勇敢的设备驱动程序设计者现在可能会设置一个请求队列来收集等待服务的请求。这可以很容易地完成,如下所示:
Our brave device driver designer might now set up a request queue that will collect the requests waiting to be serviced. This can be easily done as follows:
foo.gd->rq = blk_init_queue(foo_strategy, &foo.lock); if (!foo.gd->rq) 转到 error_no_request_queue; blk_queue_hardsect_size(foo.gd->rd, foo_hard_sector_size); blk_queue_max_sectors(foo.gd->rd,foo_max_sectors); blk_queue_max_hw_segments(foo.gd->rd,foo_max_hw_segments); blk_queue_max_phys_segments(foo.gd->rd,foo_max_phys_segments);
foo.gd->rq = blk_init_queue(foo_strategy, &foo.lock); if (!foo.gd->rq) goto error_no_request_queue; blk_queue_hardsect_size(foo.gd->rd, foo_hard_sector_size); blk_queue_max_sectors(foo.gd->rd,foo_max_sectors); blk_queue_max_hw_segments(foo.gd->rd,foo_max_hw_segments); blk_queue_max_phys_segments(foo.gd->rd,foo_max_phys_segments);
该blk_init_queue( )
函数分配一个请求队列描述符并使用默认值初始化它的许多字段。它接收设备描述符的自旋锁的地址(对于该
字段)和设备驱动程序的策略例程的地址(对于该
字段)作为其参数;请参阅下一节;“战略惯例。” 该函数还初始化该字段,强制驱动程序使用默认的 I/O 调度程序算法。如果设备驱动程序想要使用不同的电梯,它可以稍后覆盖该
字段中的地址。foo .gd->rq->queue_lockfoo .gd->rq->request_fnblk_init_queue( )foo .gd->rq->elevatorelevator
The blk_init_queue( )
function allocates a request queue descriptor and initializes many
of its fields with default values. It receives as its parameters the
address of the device descriptor's spin lock—for the
foo .gd->rq->queue_lock field—and the
address of the strategy routine of the device driver—for the
foo .gd->rq->request_fn field; see the
next section; "The
Strategy Routine." The blk_init_queue( ) function also
initializes the foo .gd->rq->elevator field, forcing the
driver to use the default I/O scheduler algorithm. If the device
driver wants to use a different elevator, it may later override the
address in the elevator
field.
接下来,一些辅助函数使用设备驱动程序的正确值设置请求队列描述符的各个字段(请参阅表 14-6以了解类似命名的字段)。
Next, some helper functions set various fields of the request queue descriptor with the proper values for the device driver (look at Table 14-6 for the similarly named fields).
正如第 4 章“ I/O 中断处理”部分所述,驱动程序需要注册设备的 IRQ 线。这可以按如下方式完成:
As described in the section "I/O Interrupt Handling" in Chapter 4, the driver needs to register the IRQ line for the device. This can be done as follows:
request_irq(foo_irq,foo_interrupt, SA_INTERRUPT|SA_SHIRQ, "foo", NULL);
request_irq(foo_irq,foo_interrupt, SA_INTERRUPT|SA_SHIRQ, "foo", NULL);
该函数是设备的中断处理程序;我们将在本章后面的“中断处理程序”部分讨论它的一些特性)。foo _interrupt()
The foo _interrupt() function is the interrupt
handler for the device; we discuss some of its peculiarities in the
section "The Interrupt
Handler" later in this chapter).
最后,所有设备驱动程序的数据结构都已准备就绪:初始化阶段的最后一步包括“注册”和激活磁盘。这可以通过简单地执行以下命令来实现:
Finally, all the device driver's data structures are ready: the last step of the initialization phase consists of "registering" and activating the disk. This can be achieved simply by executing:
添加磁盘(foo.gd);add_disk(foo.gd);该add_disk( )函数接收描述符的地址作为其参数gendisk,并实质上执行以下操作:
The add_disk( ) function
receives as its parameter the address of the gendisk descriptor, and essentially
executes the following operations:
设置GENHD_FL_UP
的标志gd->flags。
Sets the GENHD_FL_UP
flag of gd->flags.
调用以在设备驱动程序和设备的主设备号及其关联的次设备号范围之间建立链接(请参阅第 13 章中的“字符设备驱动程序”kobj_map()部分;请注意,在这种情况下 kobject 映射域由变量表示bdev_map)。
Invokes kobj_map() to
establish the link between the device driver and the device's
major number with its associated range of minor numbers (see the
section "Character
Device Drivers" in Chapter 13; be warned that
in this case the kobject mapping domain is represented by the bdev_map variable).
将设备驱动程序模型中的描述符中包含的 kobject 注册gendisk为由设备驱动程序提供服务的新设备(例如/sys/block/foo)。
Registers the kobject included in the gendisk descriptor in the device
driver model as a new device serviced by the device driver
(e.g., /sys/block/foo).
扫描磁盘中包含的分区表(如果有);对于找到的每个分区,正确初始化
数组hd_struct中相应的描述符
。还在设备驱动程序模型中注册分区(例如/sys/block/foo/foo1)。foo .gd->part
Scans the partition table included in the disk, if any;
for each partition found, properly initializes the corresponding
hd_struct descriptor in the
foo .gd->part array. Also registers the
partitions in the device driver model (e.g., /sys/block/foo/foo1).
注册设备驱动程序模型中请求队列描述符中嵌入的 kobject(例如/sys/block/foo/queue)。
Registers the kobject embedded in the request queue descriptor in the device driver model (e.g., /sys/block/foo/queue).
返回后add_disk( ),设备驱动程序正在积极工作。进行初始化阶段的函数终止;策略例程和中断处理程序负责处理由 I/O 调度程序传递到设备驱动程序的每个请求。
Once add_disk( ) returns,
the device driver is actively working. The function that carried on
the initialization phase terminates; the strategy routine and the
interrupt handler take care of each request passed to the device
driver by the I/O scheduler.
策略例程是块设备驱动程序的一个函数(或一组函数),它与硬件块设备交互以满足调度队列中收集的请求。策略例程是通过request_fn请求队列描述符的方法来调用的——上一节示例中的函数。I/O 调度程序层将请求队列描述符的地址传递给该函数。foo _strategy( )q
The strategy routine is a function—or a group of
functions—of the block device driver that interacts with the hardware
block device to satisfy the requests collected in the dispatch queue.
The strategy routine is invoked by means of the request_fn method of the request queue
descriptor—the foo _strategy( ) function in the example carried
on in the previous section. The I/O scheduler layer passes to this
function the address q of the
request queue descriptor.
正如我们将看到的,策略例程通常在将新请求插入空请求队列后启动。一旦激活,块设备驱动程序应该处理队列中的所有请求,并在队列为空时终止。
As we'll see, the strategy routine is usually started after inserting a new request in an empty request queue. Once activated, the block device driver should handle all requests in the queue and terminate when the queue is empty.
策略例程的简单实现可能如下:对于调度队列中的每个元素,将其从队列中删除,与块设备控制器交互以服务请求,并等待数据传输完成。然后继续处理调度队列中的下一个请求。
A naïve implementation of the strategy routine could be the following: for each element in the dispatch queue, remove it from the queue, interact with the block device controller to service the request, and wait until the data transfer completes. Then proceed with the next request in the dispatch queue.
这样的实现效率不是很高。即使假设可以使用 DMA 传输数据,策略例程也必须在等待 I/O 完成时挂起自身。这意味着策略例程应该在专用的内核线程上执行(我们不想惩罚不相关的用户进程,不是吗?)。此外,这样的驱动程序将无法支持可同时处理多个 I/O 数据传输的现代磁盘控制器。
Such an implementation is not very efficient. Even assuming that the data can be transferred using DMA, the strategy routine must suspend itself while waiting for I/O completion. This means that the strategy routine should execute on a dedicated kernel thread (we don't want to penalize an unrelated user process, do we?). Moreover, such a driver would not be able to support modern disk controllers that can process multiple I/O data transfers at a time.
因此,大多数块设备驱动程序采用以下策略:
Therefore, most block device drivers adopt the following strategy:
策略例程为队列中的第一个请求启动数据传输,并设置块设备控制器,以便在数据传输完成时引发中断。然后策略例程终止。
The strategy routine starts a data transfer for the first request in the queue and sets up the block device controller so that it raises an interrupt when the data transfer completes. Then the strategy routine terminates.
当磁盘控制器引发中断时,中断处理程序再次调用策略例程(通常直接调用,有时通过激活工作队列)。策略例程要么为当前请求启动另一次数据传输,要么如果请求的所有数据块都已传输,则从调度队列中删除该请求并开始处理下一个请求。
When the disk controller raises the interrupt, the interrupt handler invokes the strategy routine again (often directly, sometimes by activating a work queue). The strategy routine either starts another data transfer for the current request or, if all the chunks of data of the request have been transferred, removes the request from the dispatch queue and starts processing the next request.
请求可以由多个BIOS组成,而BIOS又可以由多个段组成。基本上,块设备驱动程序通过两种方式使用 DMA:
Requests can be composed of several bios, which in turn can be composed of several segments. Basically, block device drivers make use of DMA in two ways:
驱动程序设置不同的 DMA 传输来服务请求的每个 bio 中的每个段
The driver sets up a different DMA transfer to service each segment in each bio of the request
驱动程序设置单个分散-聚集 DMA 传输来为请求的所有 BIOS 中的所有段提供服务
The driver sets up a single scatter-gather DMA transfer to service all segments in all bios of the request
最终,设备驱动程序的策略例程的设计取决于块控制器的特性。每个物理块设备本质上都与其他所有块设备不同(例如,软盘驱动程序将磁盘磁道中的块分组并在单个 I/O 操作中传输整个磁道),因此对设备驱动程序应如何服务请求进行一般假设无意义的。
Ultimately, the design of the strategy routine of the device drivers depends on the characteristics of the block controller. Each physical block device is inherently different from all others (for example, a floppy driver groups blocks in disk tracks and transfers a whole track in a single I/O operation), so making general assumptions on how a device driver should service a request is meaningless.
在我们的示例中,策略例程可以执行以下操作:foo _strategy( )
In our example, the foo _strategy( ) strategy routine could execute
the following actions:
通过调用 I/O 调度程序帮助函数从调度队列获取当前请求elv_next_request( )。如果调度队列为空,则策略例程返回:
请求 = elv_next_request(q); if (!req) 返回;
Gets the current request from the dispatch queue by invoking
the elv_next_request( ) I/O
scheduler helper function. If the dispatch queue is empty, the
strategy routine returns:
req = elv_next_request(q); if (!req) return;
执行blk_fs_request宏来检查请求的标志位是否REQ_CMD被设置,即请求是否包含正常的读或写操作:
if (!blk_fs_request(req))
转到handle_special_request;Executes the blk_fs_request macro to check whether
the REQ_CMD flag of the request
is set, that is, whether the request contains a normal read or
write operation:
if (!blk_fs_request(req))
goto handle_special_request;如果块设备控制器支持分散-聚集 DMA,它会对磁盘控制器进行编程,以便执行整个请求的数据传输,并在传输完成时引发中断。辅助blk_rq_map_sg( )
函数返回一个分散收集列表,可立即用于启动传输。
If the block device controller supports scatter-gather DMA,
it programs the disk controller so as to perform the data transfer
for the whole request and to raise an interrupt when the transfer
completes. The blk_rq_map_sg( )
helper function returns a scatter-gather list that can be
immediately used to start the transfer.
否则,设备驱动程序必须逐段传输数据。在这种情况下,策略例程执行
rq_for_each_bio和bio_for_each_segment宏,它们分别遍历 BIOS 列表和每个 Bio 内的段列表:
rq_for_each_bio(bio, rq)
bio_for_each_segment(bvec,bio,i){
/* 传输第i段bvec */
local_irq_save(标志);
addr = kmap_atomic(bvec->bv_page, KM_BIO_SRC_IRQ); foo_start_dma_transfer(addr+bvec->bv_offset, bvec->bv_len);
kunmap_atomic(bvec->bv_page, KM_BIO_SRC_IRQ);
local_irq_restore(标志);
}如果要传输的数据可以位于高端内存,则需要kmap_atomic( )和
函数。kunmap_atomic( )该
函数对硬件设备进行编程,以便启动 DMA 传输并在 I/O 操作完成时引发中断。foo _start_dma_transfer( )
Otherwise, the device driver must transfer the data segment
by segment. In this case, the strategy routine executes the
rq_for_each_bio and bio_for_each_segment macros, which walk
the list of bios and the list of segments inside each bio,
respectively:
rq_for_each_bio(bio, rq)
bio_for_each_segment(bvec, bio, i) {
/* transfer the i-th segment bvec */
local_irq_save(flags);
addr = kmap_atomic(bvec->bv_page, KM_BIO_SRC_IRQ);foo_start_dma_transfer(addr+bvec->bv_offset, bvec->bv_len);
kunmap_atomic(bvec->bv_page, KM_BIO_SRC_IRQ);
local_irq_restore(flags);
}The kmap_atomic( ) and
kunmap_atomic( ) functions are
required if the data to be transferred can be in high memory. The
foo _start_dma_transfer( ) function programs
the hardware device so as to start the DMA transfer and to raise
an interrupt when the I/O operation completes.
返回。
Returns.
当 DMA 传输终止时,块设备驱动程序的中断处理程序被激活。它应该检查请求中的所有数据块是否都已传输。如果是,则中断处理程序调用策略例程来处理调度队列中的下一个请求。否则,中断处理程序更新请求描述符的字段并调用策略例程来处理尚未执行的数据传输。
The interrupt handler of a block device driver is activated when a DMA transfer terminates. It should check whether all chunks of data in the request have been transferred. If so, the interrupt handler invokes the strategy routine to process the next request in the dispatch queue. Otherwise, the interrupt handler updates the field of the request descriptor and invokes the strategy routine to process the data transfer yet to be performed.
我们的设备驱动程序的中断处理程序的典型片段
foo如下:
A typical fragment of the interrupt handler of our
foo device driver is the following:
irqreturn_t foo_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
struct foo_dev_t *p = (struct foo_dev_t *) dev_id;
struct request_queue *rq = p->gd->rq;
[...]
if (!end_that_request_first(rq, uptodate, nr_sectors)) {
blkdev_dequeue_request(rq);
end_that_request_last(rq);
}
rq->request_fn(rq);
[...]
返回IRQ_HANDLED;
}irqreturn_t foo_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
structfoo_dev_t *p = (struct foo_dev_t *) dev_id;
struct request_queue *rq = p->gd->rq;
[...]
if (!end_that_request_first(rq, uptodate, nr_sectors)) {
blkdev_dequeue_request(rq);
end_that_request_last(rq);
}
rq->request_fn(rq);
[...]
return IRQ_HANDLED;
}结束请求的工作分为两个函数,分别称为
end_that_request_first( )和
end_that_request_last( )。
The job of ending a request is split in two functions called
end_that_request_first( ) and
end_that_request_last( ).
该end_that_request_first( )
函数接收请求描述符、指示 DMA 数据传输是否成功完成的标志以及 DMA 传输中传输的扇区数作为参数(该函数类似,但它end_that_request_chunk( )接收传输的字节数而不是扇区数) )。本质上,该函数扫描请求中的 BIOS 以及每个 Bio 内的段,然后以如下方式更新请求描述符的字段:
The end_that_request_first( )
function receives as arguments a request descriptor, a flag indicating
if the DMA data transfer completed successfully, and the number of
sectors transferred in the DMA transfer (the end_that_request_chunk( ) function is
similar, but it receives the number of bytes transferred instead of
the number of sectors). Essentially, the function scans the bios in
the request and the segments inside each bio, then updates the fields
of the request descriptor in such a way to:
设置该bio字段,使其指向请求中第一个未完成的简介。
Set the bio field so that
it points to the first unfinished bio in the request.
设置bi_idx未完成的bio的 ,使其指向第一个未完成的片段。
Set the bi_idx of the
unfinished bio so that it points to the first unfinished
segment.
设置未完成段的bv_offset和
bv_len字段,以便它们指定尚未传输的数据。
Set the bv_offset and
bv_len fields of the unfinished
segment so that they specify the data yet to be
transferred.
该函数还bio_endio(
)对每个已完全传输的生物进行调用。
The function also invokes bio_endio(
) on each bio that has been completely transferred.
end_that_request_first( )
如果请求中的所有数据块均已传输,则该函数返回 0;否则,返回1。如果返回值为1,中断处理程序将重新启动策略例程,从而继续处理相同的请求。否则,中断处理程序会从请求队列中删除该请求(通常使用blkdev_dequeue_request( )),调用
end_that_request_last( )辅助函数,并重新启动策略例程以处理调度队列中的下一个请求。
The end_that_request_first( )
function returns 0 if all chunks of data in the request have been
transferred; otherwise, it returns 1. If the returned value is 1, the
interrupt handler restarts the strategy routine, which thus continues
processing the same request. Otherwise, the interrupt handler removes
the request from the request queue (typically by using blkdev_dequeue_request( )), invokes the
end_that_request_last( ) helper
function, and restarts the strategy routine to process the next
request in the dispatch queue.
该end_that_request_last( )
函数更新一些磁盘使用统计信息,从 I/O 调度程序的调度队列中删除请求描述符rq->elevator,唤醒在waiting
请求描述符完成时休眠的任何进程,并释放该描述符。
The end_that_request_last( )
function updates some disk usage statistics, removes the request
descriptor from the dispatch queue of the rq->elevator I/O scheduler, wakes up any
process sleeping in the waiting
completion of the request descriptor, and releases that
descriptor.
[ * ]块设备方法不应与块设备文件操作混淆;请参阅本章后面的“打开块设备文件”部分。
[*] The block device methods should not be confused with the block device file operations; see the section "Opening a Block Device File" later in this chapter.
我们通过描述 VFS 在打开块设备文件时执行的步骤来结束本章。
We conclude this chapter by describing the steps performed by the VFS when opening a block device file.
每次在磁盘或分区上安装文件系统、每次激活交换分区以及每次用户模式进程发出open( ) 块设备文件上的系统调用。在所有情况下,内核执行本质上相同的操作:它查找块设备描述符(如果块设备尚未使用,则可能分配新的描述符),并为即将到来的数据传输设置文件操作方法。
The kernel opens a block device file every time that a filesystem
is mounted over a disk or partition, every time that a swap partition is
activated, and every time that a User Mode process issues an open( ) system call on a block device file. In all cases, the
kernel executes essentially the same operations: it looks for the block
device descriptor (possibly allocating a new descriptor if the block
device is not already in use), and sets up the file operation methods
for the forthcoming data transfers.
在第13章“设备文件的VFS处理”一节中,我们描述了该
函数如何在打开设备文件时自定义文件对象的方法。此时,文件对象的字段设置为表的地址,表的内容如表14-10所示
。dentry_open( )f_opdef_blk_fops
In the section "VFS
Handling of Device Files" in Chapter 13, we described how the
dentry_open( ) function customizes
the methods of the file object when a device file is opened. In this
case, the f_op field of the file
object is set to the address of the def_blk_fops table, whose content is shown in
Table 14-10.
表 14-10。默认块设备文件操作(def_blk_fops表)
Table 14-10. The default block device file operations (def_blk_fops table)
方法 Method | 功能 Function |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
这里我们只关心由函数open调用的方法dentry_open( )。该函数接收和blkdev_open( )作为其参数,分别存储 inode 和文件对象的地址;该函数主要执行以下步骤:inodefilp
Here we are only concerned with the open method, which is invoked by the dentry_open( ) function. The blkdev_open( ) function receives as its
parameters inode and filp, which store the addresses of the inode
and file objects respectively; the function essentially executes the
following steps:
执行bd_acquire(inode )
获取地址bdev
块设备描述符。反过来,该函数接收 inode 对象地址并执行以下步骤:
检查inode->i_bdevinode对象的字段是否不存在NULL;如果是,则块设备文件已经打开,该字段存储相应块描述符的地址。在这种情况下,该函数增加与块设备关联的bdevinode->i_bdev->bd_inode特殊文件系统的
inode的使用计数器
,并返回描述符的地址
。inode->i_bdev
这里块设备文件还没有被打开。执行以获取与块设备文件的主设备号和次设备号相对应的块设备描述符的地址(请参阅本章前面的“块设备bdget(inode->i_rdev)”部分)。如果描述符尚不存在,
则分配它;但请注意,描述符可能已经存在,例如因为块设备已被另一个块设备文件访问。bdget( )
将块设备描述符地址存储在 中inode->i_bdev,以加快以后对同一个块设备文件的打开操作。
inode->i_mapping使用 inode中相应字段的值设置该字段bdev。这是指向地址空间对象的指针,这将在第15章的“地址空间对象”部分中进行解释。
插入inode到以 为根的块设备描述符的已打开 inode 列表中
bdev->bd_inodes。
bdev返回描述符的地址。
Executes bd_acquire(inode )
to get the address bdev
of the block device descriptor. In turn, this
function receives the inode object address and performs the
following steps:
Checks whether the inode->i_bdev field of the inode
object is not NULL; if it is,
the block device file has been opened already, and this field
stores the address of the corresponding block descriptor. In
this case, the function increases the usage counter of the
inode->i_bdev->bd_inode
inode of the bdev special filesystem
associated with the block device, and returns the address
inode->i_bdev of the
descriptor.
Here the block device file has not been opened yet.
Executes bdget(inode->i_rdev) to get the
address of the block device descriptor corresponding to the
major and minor number of the block device file (see the section
"Block
Devices" earlier in this chapter). If the descriptor does
not already exist, bdget( )
allocates it; notice however that the descriptor might already
exist, for instance because the block device is already being
accessed by means of another block device file.
Stores the block device descriptor address in inode->i_bdev, so as to speed up
future opening operations on the same block device file.
Sets the inode->i_mapping field with the
value of the corresponding field in the bdev inode. This is the pointer to the
address space object, which will be explained in the section
"The address_space
Object" in Chapter
15.
Inserts inode into the
list of opened inodes of the block device descriptor rooted at
bdev->bd_inodes.
Returns the address bdev of the descriptor.
将字段设置filp->i_mapping
为值inode->i_mapping(请参阅上面的步骤 1(d))。
Sets the filp->i_mapping
field with the value of inode->i_mapping (see step 1(d)
above).
gendisk获取描述符相对于该块设备的地址:
磁盘 = get_gendisk(bdev->bd_dev, &part);
如果正在打开的块设备是分区,该函数还会返回其在part局部变量中的索引;否则,part被设置为零。该函数只是在kobject 映射域上get_gendisk( )调用
kobj_lookup( )bdev_map传递设备的主设备号和次设备号(另请参阅本章前面的“设备驱动程序注册和初始化”部分)。
Gets the address of the gendisk descriptor relative to this block
device:
disk = get_gendisk(bdev->bd_dev, &part);
If the block device being opened is a partition, the function
also returns its index in the part local variable; otherwise, part is set to zero. The get_gendisk( ) function simply invokes
kobj_lookup( ) on the bdev_map kobject mapping domain passing the major and minor number of the device (see
also the section "Device Driver Registration
and Initialization" earlier in this chapter).
如果bdev->bd_openers不等于0,则块设备已被打开。检查bdev->bd_contains
字段:
如果它等于bdev,则块设备是整个磁盘:调用bdev->bd_disk->fops->open
块设备方法(如果已定义),然后检查该bdev->bd_invalidated字段并调用函数(如有必要)rescan_partitions( )(请参阅稍后对步骤 6a 和 6c 的注释)。
如果不等于bdev,则块设备是分区:增加计数器bdev->bd_contains->bd_part_count
。
然后,跳至步骤 8。
If bdev->bd_openers is
not equal to zero, the block device has already been opened. Checks
the bdev->bd_contains
field:
If it is equal to bdev,
the block device is a whole disk: invokes the bdev->bd_disk->fops->open
block device method, if defined, then checks the bdev->bd_invalidated field and
invokes, if necessary, the rescan_partitions( ) functions (see
comments on steps 6a and 6c later).
If it not equal to bdev, the block device is a partition:
increases the bdev->bd_contains->bd_part_count
counter.
Then, jumps to step 8.
这里是第一次访问块设备。使用描述符的bdev->bd_disk地址进行初始化。diskgendisk
Here the block device is being accessed for the first time.
Initializes bdev->bd_disk with
the address disk of the gendisk descriptor.
如果块设备是整个磁盘(part为零),则执行以下子步骤:
如果定义,它将执行disk->fops->open块设备方法:它是由块设备驱动程序定义的自定义函数,用于执行任何特定的最后一刻初始化。
从请求hardsect_size队列的字段disk->queue获取扇区大小(以字节为单位),并使用该值正确设置
bdev->bd_block_size和
bdev->bd_inode->i_blkbits
字段。还设置bdev->bd_inode->i_size具有根据 计算出的磁盘大小的字段disk->capacity。
如果bdev->bd_invalidated设置了该标志,它将调用rescan_partitions(
)扫描分区表并更新分区描述符。该标志是通过check_disk_change块设备方法设置的,该方法仅适用于可移动设备。
If the block device is a whole disk (part is zero), it executes the following
substeps:
If defined, it executes the disk->fops->open block device
method: it is a custom function defined by the block device
driver to perform any specific last-minute
initialization.
Gets from the hardsect_size field of the disk->queue request queue the
sector size in bytes, and uses this value to set properly the
bdev->bd_block_size and
bdev->bd_inode->i_blkbits
fields. Sets also the bdev->bd_inode->i_size field
with the size of the disk computed from disk->capacity.
If the bdev->bd_invalidated flag is set,
it invokes rescan_partitions(
) to scan the partition table and update the partition
descriptors. The flag is set by the check_disk_change block device method,
which applies only to removable devices.
否则,如果块设备是分区(part不为零),则执行以下子步骤:
再次调用bdget( )
(这次传递次要编号)以获取整个磁盘的块描述符的disk->first_minor地址。whole
对整个磁盘的块设备描述符重复步骤 3 到 6,从而在必要时对其进行初始化。
设置bdev->bd_contains为整个磁盘的描述符的地址。
增加whole->bd_part_count以考虑磁盘分区上的新打开操作。
设置为;bdev->bd_part
中的值
它是分区描述符disk->part[part-1]的地址。hd_struct另外,执行kobject_get(&bdev->bd_part->kobj)
以增加分区的引用计数器。
如步骤 6b 中所示,设置指定分区大小和扇区大小的索引节点字段。
Otherwise if the block device is a partition (part is not zero), it executes the
following substeps:
Invokes bdget( )
again—this time passing the disk->first_minor minor number—to
get the address whole of the
block descriptor for the whole disk.
Repeats steps from 3 to 6 for the block device descriptor of the whole disk, thus initializing it if necessary.
Sets bdev->bd_contains to the address of
the descriptor of the whole disk.
Increases whole->bd_part_count to account for
the new open operation on the partition of the disk.
Sets bdev->bd_part
with the value in disk->part[part-1]; it is the
address of the hd_struct
descriptor of the partition. Also, executes kobject_get(&bdev->bd_part->kobj)
to increase the reference counter of the partition.
As in step 6b, sets the inode fields that specify size and sector size of the partition.
增加bdev->bd_openers计数器。
Increases the bdev->bd_openers counter.
如果块设备文件以独占模式打开(O_EXCL设置标志filp->f_flags),它将调用设置块设备的持有者(请参阅本章前面的“块设备bd_claim(bdev, filp)”部分)。如果发生错误——块设备已经有一个持有者——它会释放块设备描述符并返回错误代码。-EBUSY
If the block device file is being opened in exclusive mode
(O_EXCL flag in filp->f_flags set), it invokes bd_claim(bdev, filp) to set the holder of
the block device (see the section "Block Devices" earlier
in this chapter). In case of error—block device has already an
holder—it releases the block device descriptor and returns the error
code -EBUSY.
通过返回 0(成功)来终止。
Terminates by returning 0 (success).
一旦blkdev_open( )
函数终止,open( )
系统调用将照常进行。未来对打开的文件发出的每个系统调用都将触发默认块设备文件操作之一。正如我们将在第 16 章中看到的,每次传入或传出块设备的数据传输都是通过向通用块层提交请求来有效实现的。
Once the blkdev_open( )
function terminates, the open( )
system call proceeds as usual. Every future system call issued on the
opened file will trigger one of the default block device file
operations. As we will see in Chapter 16, each data transfer to
or from the block device is effectively implemented by submitting
requests to the generic block layer.
正如第 12 章“通用文件模型” 一节中已经提到的,磁盘缓存是一种软件机制,允许系统将通常存储在磁盘上的一些数据保留在 RAM 中,以便可以进一步访问该数据。无需访问磁盘即可快速满足。
As already mentioned in the section "The Common File Model" in Chapter 12, a disk cache is a software mechanism that allows the system to keep in RAM some data that is normally stored on a disk, so that further accesses to that data can be satisfied quickly without accessing the disk.
磁盘缓存对于系统性能至关重要,因为重复访问相同的磁盘数据是很常见的。与磁盘交互的用户态进程有权重复请求读取或写入相同的磁盘数据。而且,不同的进程还可能需要在不同的时间寻址相同的磁盘数据。例如,您可以使用 cp命令复制文本文件,然后调用您喜欢的编辑器来修改它。为了满足您的请求,命令 shell 将创建两个不同的进程,在不同的时间访问同一文件。
Disk caches are crucial for system performance, because repeated accesses to the same disk data are quite common. A User Mode process that interacts with a disk is entitled to ask repeatedly to read or write the same disk data. Moreover, different processes may also need to address the same disk data at different times. As an example, you may use the cp command to copy a text file and then invoke your favorite editor to modify it. To satisfy your requests, the command shell will create two different processes that access the same file at different times.
我们已经遇到过其他磁盘缓存第12 章:dentry 缓存,它存储表示文件系统路径名的 dentry 对象和 inode 缓存,它存储表示磁盘 inode 的 inode 对象。但请注意,dentry 对象和 inode 对象不仅仅是存储某些磁盘块内容的缓冲区;它们还包含在其他磁盘块中。因此,dentry 缓存和 inode 缓存作为磁盘缓存来说是相当特殊的。
We have already encountered other disk caches in Chapter 12: the dentry cache , which stores dentry objects representing filesystem pathnames, and the inode cache , which stores inode objects representing disk inodes. Notice, however, that dentry objects and inode objects are not mere buffers storing the contents of some disk blocks; thus, the dentry cache and the inode cache are rather peculiar as disk caches.
本章涉及页面缓存,这是一个作用于整页数据的磁盘缓存。我们在第一节介绍页面缓存。然后,我们在“在页面缓存中存储块”部分讨论如何使用页面缓存来检索单个数据块(例如,超级块和索引节点);此功能对于加速 VFS 和基于磁盘的文件系统至关重要。接下来,我们在“将脏页写入磁盘”一节中描述如何将页缓存中的脏页写回磁盘。最后,我们在上一节“ sync()、fsync()和fdatasync()系统调用”中提到了一些允许用户刷新页面缓存内容以更新磁盘内容的系统调用。
This chapter deals with the page cache , which is a disk cache working on whole pages of data. We introduce the page cache in the first section. Then, we discuss in the section "Storing Blocks in the Page Cache" how the page cache can be used to retrieve single blocks of data (for instance, superblocks and inodes); this feature is crucial to speed up the VFS and the disk-based filesystems. Next, we describe in the section "Writing Dirty Pages to Disk" how the dirty pages in the page cache are written back to disk. Finally, we mention in the last section "The sync( ), fsync( ), and fdatasync( ) System Calls" some system calls that allow a user to flush the contents of the page cache so as to update the disk contents.
页缓存是 Linux 内核使用的主要磁盘缓存。在大多数情况下,内核在读取或写入磁盘时会引用页面缓存。新页面被添加到页面缓存中以满足用户模式进程的读取请求。如果该页尚未在缓存中,则会将新条目添加到缓存中并填充从磁盘读取的数据。如果有足够的可用内存,该页面将无限期地保留在缓存中,然后可以由其他进程重用,而无需访问磁盘。
The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes's read requests. If the page is not already in the cache, a new entry is added to the cache and filled with the data read from the disk. If there is enough free memory, the page is kept in the cache for an indefinite period of time and can then be reused by other processes without accessing the disk.
同样,在将一页数据写入块设备之前,内核会验证相应的页是否已经包含在缓存中;如果没有,则将一个新条目添加到缓存中并填充要写入磁盘的数据。I/O数据传输不会立即开始:磁盘更新会延迟几秒钟,从而给进程进一步修改要写入的数据的机会(换句话说,内核实现了延迟写操作)。
Similarly, before writing a page of data to a block device, the kernel verifies whether the corresponding page is already included in the cache; if not, a new entry is added to the cache and filled with the data to be written on disk. The I/O data transfer does not start immediately: the disk update is delayed for a few seconds, thus giving a chance to the processes to further modify the data to be written (in other words, the kernel implements deferred write operations).
内核代码和内核数据结构不需要从磁盘读取或写入磁盘。[ * ]因此,页面缓存中包含的页面可以是以下类型:
Kernel code and kernel data structures don't need to be read from or written to disk.[*] Thus, the pages included in the page cache can be of the following types:
包含常规文件数据的页面;在第16章中,我们描述了内核如何处理它们的读、写和内存映射操作。
Pages containing data of regular files; in Chapter 16, we describe how the kernel handles read, write, and memory mapping operations on them.
包含目录的页面;正如我们将在第 18 章中看到的,Linux 处理目录的方式与处理常规文件的方式非常相似。
Pages containing directories; as we'll see in Chapter 18, Linux handles the directories much like regular files.
包含直接从块设备文件读取的数据的页面(跳过文件系统层);正如第 16 章中所讨论的,内核使用与包含常规文件数据的页面相同的一组函数来处理它们。
Pages containing data directly read from block device files (skipping the filesystem layer); as discussed in Chapter 16, the kernel handles them using the same set of functions as for pages containing data of regular files.
包含已在磁盘上换出的用户模式进程数据的页面。正如我们将在第 17 章中看到的,内核可能被迫在页面缓存中保留一些其内容已经写入交换区域(常规文件或磁盘分区)的页面。
Pages containing data of User Mode processes that have been swapped out on disk. As we'll see in Chapter 17, the kernel could be forced to keep in the page cache some pages whose contents have been already written on a swap area (either a regular file or a disk partition).
属于特殊文件系统(例如 shm)的文件的页面 用于进程间通信(IPC)共享内存区域的特殊文件系统(参见第 19 章)。
Pages belonging to files of special filesystems, such as the shm special filesystem used for Interprocess Communication (IPC) shared memory region (see Chapter 19).
正如您所看到的,页面缓存中包含的每个页面都包含属于某个文件的数据。该文件(或更准确地说是文件的索引节点)称为页面的所有者。(正如我们将在第 17 章中看到的,包含换出数据的页面具有相同的所有者,即使它们引用不同的交换区域。)
As you can see, each page included in the page cache contains data belonging to some file. This file—or more precisely the file's inode—is called the page's owner. (As we will see in Chapter 17, pages containing swapped-out data have the same owner even if they refer to different swap areas.)
实际上,所有read( )文件
write( )操作都依赖于页面缓存。唯一的例外发生在进程打开设置了标志的文件时O_DIRECT:在这种情况下,页面缓存被绕过,I/O 数据传输利用进程的用户模式地址空间中的缓冲区(请参阅“直接I/O 传输”(第16 章);一些数据库应用程序使用该O_DIRECT标志,以便它们可以使用自己的磁盘缓存算法。
Practically all read( ) and
write( ) file operations rely on the
page cache. The only exception occurs when a process opens a file with
the O_DIRECT flag set: in this case,
the page cache is bypassed and the I/O data transfers make use of
buffers in the User Mode address space of the process (see the section
"Direct I/O Transfers"
in Chapter 16); several
database applications make use of the O_DIRECT flag so that they can use their own
disk caching algorithm.
内核设计者实现页面缓存是为了满足两个主要要求:
Kernel designers have implemented the page cache to fulfill two main requirements:
快速定位包含与给定所有者相关的数据的特定页面。为了最大限度地利用页面缓存,搜索它应该是一个非常快的操作。
Quickly locate a specific page containing data relative to a given owner. To take the maximum advantage from the page cache, searching it should be a very fast operation.
跟踪在读取或写入其内容时应如何处理缓存中的每个页面。例如,从常规文件、块设备文件或交换区域读取页面必须以不同的方式执行,因此内核必须根据页面的所有者选择正确的操作。
Keep track of how every page in the cache should be handled when reading or writing its content. For instance, reading a page from a regular file, a block device file, or a swap area must be performed in different ways, thus the kernel must select the proper operation depending on the page's owner.
页缓存中保存的信息单位当然是整页数据。正如我们将在第 18 章中看到的,页面不一定包含物理上相邻的磁盘块,因此它不能通过设备号和块号来标识。相反,页面缓存中的页面由所有者和所有者数据中的索引(通常是相应文件内的索引节点和偏移量)来标识。
The unit of information kept in the page cache is, of course, a whole page of data. As we'll see in Chapter 18, a page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by a device number and a block number. Instead, a page in the page cache is identified by an owner and by an index within the owner's data—usually, an inode and an offset inside the corresponding file.
页缓存的核心数据结构是对象
address_space,是嵌入拥有该页的索引节点对象中的数据结构。[ * ]缓存中的许多页面可能引用同一所有者,因此它们可能链接到同一address_space对象。该对象还在所有者的页面和在这些页面上操作的一组方法之间建立链接。
The core data structure of the page cache is the
address_space object, a data
structure embedded in the inode object that owns the page.[*] Many pages in the cache may refer to the same owner,
thus they may be linked to the same address_space object. This object also
establishes a link between the owner's pages and a set of methods that
operate on these pages.
每个页面描述符都包含两个字段,称为mapping和,它们将页面链接到页面缓存(请参阅第 8 章中的“页面描述符”index部分)。第一个字段指向拥有该页面的索引节点的对象。第二个字段指定所有者“地址空间”内以页面大小为单位的偏移量,即页面数据在所有者磁盘映像内的位置。在页面缓存中查找页面时将使用这两个字段。address_space
Each page descriptor includes two fields called mapping and index, which link the page to the page cache
(see the section "Page
Descriptors" in Chapter
8). The first field points to the address_space object of the inode that owns
the page. The second field specifies the offset in page-size units
within the owner's "address space," that is, the position of the
page's data inside the owner's disk image. These two fields are used
when looking for a page in the page cache.
令人惊讶的是,页面缓存可能很乐意包含相同磁盘数据的多个副本。例如,可以通过以下方式访问常规文件的相同 4 KB 数据块:
Quite surprisingly, the page cache may happily contain multiple copies of the same disk data. For instance, the same 4-KB block of data of a regular file can be accessed in the following ways:
读取文件;因此,数据包含在常规文件的索引节点所拥有的页面中。
Reading the file; hence, the data is included in a page owned by the regular file's inode.
从承载该文件的设备文件(磁盘分区)中读取块;因此,数据包含在块设备文件的主索引节点所拥有的页面中。
Reading the block from the device file (disk partition) that hosts the file; hence, the data is included in a page owned by the master inode of the block device file.
因此,相同的磁盘数据出现在两个不同对象引用的两个不同页面中address_space。
Thus, the same disk data appears in two different pages
referenced by two different address_space objects.
对象的字段如表15-1address_space所示。
The fields of the address_space object are shown in Table 15-1.
表 15-1。地址空间对象的字段
Table 15-1. The fields of the address_space object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向托管该对象的 inode 的指针(如果有) Pointer to the inode hosting this object, if any |
| | 标识所有者页面的基数树的根 Root of radix tree identifying the owner's pages |
| | 保护基数树的自旋锁 Spin lock protecting the radix tree |
| | 地址空间中共享内存映射的数量 Number of shared memory mappings in the address space |
| | 基数优先级搜索树的根(参见第 17 章) Root of the radix priority search tree (see Chapter 17) |
| | 地址空间中的非线性内存区域列表 List of non-linear memory regions in the address space |
| | 保护基数优先级搜索树的自旋锁 Spin lock protecting the radix priority search tree |
| | 截断文件时使用的序列计数器 Sequence counter used when truncating the file |
| | 所有者页面总数 Total number of owner's pages |
| | 所有者页面上最后一次回写操作的页面索引 Page index of the last write-back operation on the owner's pages |
| | 在所有者页面上操作的方法 Methods that operate on the owner's pages |
无符号长 unsigned long | 旗帜 flags | 错误位和内存分配器标志 Error bits and memory allocator flags |
结构 backing_dev_info * struct backing_dev_info * | 支持_开发_信息 backing_dev_info | 指向 Pointer to the |
自旋锁_t spinlock_t | 私有锁 private_lock | 通常,管理 Usually, spin lock used when
managing the |
结构体列表头 struct list head | 私有列表 private_list | 通常,与 inode 关联的间接块的脏缓冲区列表 Usually, a list of dirty buffers of indirect blocks associated with the inode |
结构地址空间* struct address_space * | 关联映射 assoc_mapping | 通常,指向 Usually, pointer to the |
如果页面缓存中页面的所有者是一个文件,则该address_space对象将嵌入到
i_dataVFS inode 对象的字段中。inode 的字段i_mapping始终指向address_space
包含 inode 数据的页面所有者的对象。host对象的字段指向
address_space嵌入描述符的inode对象。
If the owner of a page in the page cache is a file, the address_space object is embedded in the
i_data field of a VFS inode object.
The i_mapping field of the inode
always points to the address_space
object of the owner of the pages containing the inode's data. The
host field of the address_space object points to the inode
object in which the descriptor is embedded.
因此,如果页面属于存储在 Ext3 文件系统中的文件,页面的所有者是文件的inode,对应的address_space
对象存储在i_data
VFS inode对象的字段中。inode的字段i_mapping指向
i_data同一个inode的字段,对象host的字段address_space指向同一个inode。
Thus, if a page belongs to a file that is stored in an Ext3
filesystem , the owner of the page is the inode of the file and
the corresponding address_space
object is stored in the i_data
field of the VFS inode object. The i_mapping field of the inode points to the
i_data field of the same inode, and
the host field of the address_space object points to the same
inode.
然而,有时事情会更加复杂。如果页面包含从块设备文件读取的数据(即,它存储块设备的“原始”数据),则该对象将嵌入到bdevaddress_space中文件的“主”inode 中 与块设备关联的特殊文件系统(该 inode 由块设备描述符的字段引用,请参阅第 14 章中的“块设备”bd_inode部分)。因此,
块设备文件的inode字段指向主inode中嵌入的对象;相应地,对象
的字段指向主inode。这样,包含从块设备读取的数据的所有页面都具有相同的对象,即使它们是通过引用不同的块设备文件来访问的。i_mappingaddress_spacehostaddress_spaceaddress_space
Sometimes, however, things are more complicated. If a page
contains data read from a block device file—that is, it stores "raw"
data of a block device—the address_space object is embedded in the
"master" inode of the file in the bdev special filesystem associated with the block device
(this inode is referenced by the bd_inode field of the block device
descriptor, see the section "Block Devices" in Chapter 14). Therefore, the
i_mapping field of an inode of a
block device file points to the address_space object embedded in the master
inode; correspondingly, the host
field of the address_space object
points to the master inode. In this way, all pages containing data
read from a block device have the same address_space object, even if they have been
accessed by referring to different block device files.
i_mmap、i_mmap_writable、i_mmap_nonlinear、 和字段i_mmap_lock指内存映射和反向映射。我们将在第 16 章和第17 章讨论这些主题。
The i_mmap, i_mmap_writable, i_mmap_nonlinear, and i_mmap_lock fields refer to memory mapping
and reverse mapping. We'll discuss these topics in Chapter 16 and Chapter 17.
该backing_dev_info字段指向backing_dev_info
与存储所有者数据的块设备关联的描述符。正如第14章“请求队列描述符”一节中所解释的,该
结构通常嵌入在块设备的请求队列描述符中。backing_dev_info
The backing_dev_info field
points the backing_dev_info
descriptor associated with the block device storing the data of the
owner. As explained in the section "Request Queue
Descriptors" in Chapter
14, the backing_dev_info
structure is usually embedded in the request queue descriptor of the
block device.
该private_list字段是通用列表的头部,文件系统可以出于其特定目的自由使用该列表。例如,Ext2 文件系统利用此列表来收集与 inode 关联的“间接”块的脏缓冲区(请参阅第18 章中的“数据块寻址”部分)。当刷新操作强制将 inode 写入磁盘时,内核也会刷新此列表中的所有缓冲区。此外,Ext2 文件系统在该字段中存储一个指向
包含间接块的块设备对象的指针;它还使用自旋锁来保护多处理器系统中的间接块列表。assoc_mappingaddress_spaceassoc_mapping->private_lock
The private_list field is the
head of a generic list that can be freely used by the filesystem for
its specific purposes. For example, the Ext2 filesystem makes use of this list to collect the dirty buffers of
"indirect" blocks associated with the inode (see the section "Data Blocks Addressing"
in Chapter 18). When a
flush operation forces the inode to be written to disk, the kernel
flushes also all the buffers in this list. Moreover, the Ext2
filesystem stores in the assoc_mapping field a pointer to the
address_space object of the block
device containing the indirect blocks; it also uses the assoc_mapping->private_lock spin lock to
protect the lists of indirect blocks in multiprocessor systems.
该对象的一个关键字段address_space是a_ops,它指向一个类型表
address_space_operations,其中包含定义如何处理所有者页面的方法。这些方法如表 15-2所示。
A crucial field of the address_space object is a_ops, which points to a table of type
address_space_operations containing
the methods that define how the owner's pages are handled. These
methods are shown in Table
15-2.
表 15-2。地址空间对象的方法
Table 15-2. The methods of the address_space object
最重要的方法是readpage、writepage、prepare_write和commit_write。我们将在第 16 章中讨论它们。在大多数情况下,这些方法将所有者 inode 对象与访问物理设备的低级驱动程序链接起来。例如,实现readpage常规文件的inode方法的函数知道如何定位文件的每一页对应的块在物理磁盘设备上的位置。然而,在本章中,我们不必address_space进一步讨论这些方法。
The most important methods are readpage, writepage, prepare_write, and commit_write. We discuss them in Chapter 16. In most cases, the
methods link the owner inode objects with the low-level drivers that
access the physical devices. For instance, the function that
implements the readpage method for
an inode of a regular file knows how to locate the positions on the
physical disk device of the blocks corresponding to each page of the
file. In this chapter, however, we don't have to discuss the address_space methods further.
在 Linux 中,文件可能很大,甚至几 TB。当访问大文件时,页面缓存可能会被太多文件页面填满,以至于顺序扫描所有页面将非常耗时。为了有效地执行页面缓存查找,Linux 2.6 使用了一大组搜索树,每个address_space
对象一个。
In Linux, files can have large sizes, even a few
terabytes. When accessing a large file, the page cache may become
filled with so many of the file's pages that sequentially scanning all
of them would be too time-consuming. In order to perform page cache
lookup efficiently, Linux 2.6 makes use of a large set of search
trees, one for each address_space
object.
page_tree对象的域
是基数树address_space的根,其中包含指向所有者页面描述符的指针。给定一个表示页面在所有者磁盘映像内的位置的页面索引,内核可以执行非常快速的查找操作,以确定所需的页面是否已包含在页面缓存中。当查找页面时,内核将索引解释为基数树内的路径,并快速到达页面描述符存储或应该存储的位置。如果找到,内核可以从基数树中检索该页面的描述符;它还可以快速确定页面是否脏(即要刷新到磁盘)以及其数据的 I/O 传输当前是否正在进行。
The page_tree field of an
address_space object is the root of
a radix tree, which contains pointers to the
descriptors of the owner's pages. Given a page index denoting the
position of the page inside the owner's disk image, the kernel can
perform a very fast lookup operation in order to determine whether the
required page is already included in the page cache. When looking up
the page, the kernel interprets the index as a path inside the radix
tree and quickly reaches the position where the page descriptor is—or
should be—stored. If found, the kernel can retrieve from the radix
tree the descriptor of the page; it can also quickly determine whether
the page is dirty (i.e., to be flushed to disk) and whether an I/O
transfer for its data is currently on-going.
基数树的每个节点最多可以有 64 个指向其他节点或页面描述符的指针。底层节点存储指向页面描述符(叶子)的指针,而高层节点存储指向其他节点(子节点)的指针。每个节点都由radix_tree_node数据结构表示,该数据结构包括三个字段:slots
是一个包含 64 个指针的数组,count
是一个计数器,表示该节点中有多少个指针不是NULL,并且tags是一个由两个组成部分组成的标志数组,将在本节中讨论本章后面的“基数树的标签”。树的根由一个radix_tree_root
数据结构表示,具有三个字段:height表示当前树的高度(不包括叶子的层数),gfp_mask指定为新节点请求内存时使用的标志,并rnode指向radix_tree_node与树的第 1 层节点对应的数据结构(如果有)。
Each node of the radix tree can have up to 64 pointers to other
nodes or to page descriptors. Nodes at the bottom level store pointers
to page descriptors (the leaves), while nodes at higher levels store
pointers to other nodes (the children). Each node is represented by
the radix_tree_node data structure,
which includes three fields: slots
is an array of 64 pointers, count
is a counter of how many pointers in the node are not NULL, and tags is a two-component array of flags that
will be discussed in the section "The Tags of the Radix
Tree" later in this chapter. The root of the tree is
represented by a radix_tree_root
data structure, having three fields: height denotes the current tree's height
(number of levels excluding the leaves), gfp_mask specifies the flags used when
requesting memory for a new node, and rnode points to the radix_tree_node data structure corresponding
to the node at level 1 of the tree (if any).
让我们考虑一个简单的例子。如果树中存储的索引都不大于 63,则树高等于 1,因为 64 个潜在叶子都可以存储在第 1 层的节点中(见图 15-1 (a) ) 。然而,如果必须将与索引 131 对应的新页面描述符存储在页面缓存中,则树高度将增加到 2,以便基数树可以精确定位高达 4095 的索引(见图 15-1 (b) )。
Let us consider a simple example. If none of the indices stored in the tree is greater than 63, the tree height is equal to one, because the 64 potential leaves can all be stored in the node at level 1 (see Figure 15-1 (a)). If, however, a new page descriptor corresponding to index 131 must be stored in the page cache, the tree height is increased to two, so that the radix tree can pinpoint indices up to 4095 (see Figure 15-1(b)).
表 15-3 显示了 32 位架构上基数树每个给定高度的最高页面索引和相应的最大文件大小。在这种情况下,基数树的最大高度为 6,尽管系统的页面缓存不太可能使用那么大的基数树。由于页索引存储在 32 位变量中,因此当树的高度等于 6 时,最高级别的节点最多可以有 4 个子节点。
Table 15-3 shows the highest page index and the corresponding maximum file size for each given height of the radix tree on a 32-bit architecture. In this case, the maximum height of a radix tree is six, although it is quite unlikely that the page cache of your system will make use of a radix tree that huge. Because the page index is stored in a 32-bit variable, when the tree has height equal to six, the node at the highest level can have at most four children.
表 15-3。每个基数树高度的最高索引和最大文件大小
Table 15-3. Highest index and maximum file size for each radix tree height
板蓝根树高 Radix tree height | 最高指数 Highest index | 最大文件大小 Maximum file size |
|---|---|---|
| 没有任何 none | 0 字节 0 bytes |
| 2 6 -1 = 63 26 -1 = 63 | 256 KB 256 kilobytes |
| 2 12 -1 = 4 095 212 -1 = 4 095 | 16兆字节 16 megabytes |
3 3 | 2 18 -1 = 262 143 218 -1 = 262 143 | 1GB 1 gigabyte |
4 4 | 2 24 -1 = 16 777 215 224-1 = 16 777 215 | 64GB 64 gigabytes |
5 5 | 2 30 -1 = 1 073 741 823 230 -1 = 1 073 741 823 | 4 TB 4 terabytes |
6 6 | 2 32 -1 = 4 294 967 295 232 -1 = 4 294 967 295 | 16 TB 16 terabytes |
了解页面查找如何执行的最佳方法是回顾分页系统如何利用页表将线性地址转换为物理地址。正如第 2章“常规分页” 一节中所讨论的,线性地址的 20 个最高有效位被分成两个 10 位长的字段:第一个字段是页目录中的偏移量,而第二个字段是页目录中的偏移量。在由正确的页目录条目指向的页表中。
The best way to understand how page lookup is performed is to recall how the paging system makes use of the page tables to translate linear addresses into physical addresses. As discussed in the section "Regular Paging" in Chapter 2, the 20 most significant bits of a linear address are split into two 10-bit long fields: the first field is an offset in the Page Directory, while the second one is an offset in the Page Table pointed to by the proper Page Directory entry.
基数树中使用了类似的方法。线性地址的等价物是页的索引。但是,页面索引中要考虑的字段数量取决于基数树的高度。如果基数树的高度为1,则只能表示从0到63的索引,因此页面索引的6个较低有效位被解释为slots第 1 层单个节点的数组索引。如果基数树的高度为 2,则可以表示的索引范围为 0 到 4095;因此,页面索引的 12 个较低有效位被分成 2 个字段,每个字段 6 位;最重要的字段用作第 1 层节点的数组索引,而次重要字段用作第 2 层节点的数组索引。对于每个其他基数树的高度,该过程类似。如果高度等于 6,则页面索引的 2 个最高有效位是第 1 层节点的数组索引,接下来的 6 位是第 2 层节点的数组索引,依此类推,直到第 6 位较低有效位,它们是第 6 层节点的数组索引。
A similar approach is used in the radix tree. The equivalent of
the linear address is the page's index. However, the number of fields
to be considered in the page's index depends on the height of the
radix tree. If the radix tree has height 1, only indices ranging from
0 to 63 can be represented, thus the 6 less significant bits of the
page's index are interpreted as the slots array index for the single node at
level 1. If the radix tree has height 2, the indices that can be
represented range from 0 to 4095; the 12 less significant bits of the
page's index are thus split in 2 fields of 6 bits each; the most
significant field is used as an array index for the node at level 1,
while the less significant field is used as an array index for the
node at level 2. The procedure is similar for every other radix tree's
height. If the height is equal to 6, the 2 most significant bits of
the page's index are the array index for the node at level 1, the
following 6 bits are the array index for the node at level 2, and so
on up to the 6 less significant bits, which are the array index for
the node at level 6.
如果基数树的最高索引小于应添加的页面的索引,则内核相应地增加树高;基数树的中间节点取决于页面索引的值(示例见图15-1 )。
If the highest index of a radix tree is smaller than the index of a page that should be added, then the kernel increases the tree height correspondingly; the intermediate nodes of the radix tree depend on the value of the page index (see Figure 15-1 for an example).
使用页面缓存的基本高级功能包括查找、添加和删除页面。基于先前功能的另一个功能可确保缓存包含给定页面的最新版本。
The basic high-level functions that use the page cache involve finding, adding, and removing a page. Another function based on the previous ones ensures that the cache includes an up-to-date version of a given page.
该find_get_page( )
函数接收一个指向address_space对象的指针和一个偏移值作为其参数。它获取地址空间的自旋锁并调用该radix_tree_lookup( )函数来搜索具有所需偏移量的基数树的叶节点。该函数依次从树的根节点开始,根据偏移值的位向下移动,如上一节所述。如果NULL
遇到指针,则函数返回NULL;否则,返回叶节点的地址,即所需页描述符的指针。如果找到请求的页面,find_get_page( )增加其使用计数器,释放自旋锁,并返回其地址;否则,该函数释放自旋锁并返回NULL。
The find_get_page( )
function receives as its parameters a pointer to an address_space object and an offset value.
It acquires the address space's spin lock and invokes the radix_tree_lookup( ) function to search
for a leaf node of the radix tree having the required offset. This
function, in turn, starts from the root node of the tree and goes
down according to the bits of the offset value, as explained in the
previous section. If a NULL
pointer is encountered, the function returns NULL; otherwise, it returns the address of
a leaf node, that is, the pointer of the required page descriptor.
If the requested page is found, find_get_page( ) increases its usage
counter, releases the spin lock, and returns its address; otherwise,
the function releases the spin lock and returns NULL.
该find_get_pages( )
函数与 类似find_get_page(
),但它对具有连续索引的一组页面执行页面缓存查找。它接收指向对象的指针address_space
、开始搜索的地址空间中的偏移量、要检索的最大页数以及指向要由函数填充的页描述符数组的指针作为其参数。要执行查找操作,find_get_pages( )需要依赖该radix_tree_gang_lookup( )函数,该函数填充指针数组并返回找到的页数。返回的页面具有升序索引,尽管索引中可能存在漏洞,因为某些页面可能不在页面缓存中。
The find_get_pages( )
function is similar to find_get_page(
), but it performs a page cache lookup for a group of
pages having contiguous indices. It receives as its parameters a
pointer to an address_space
object, the offset in the address space from where to start
searching, the maximum number of pages to be retrieved, and a
pointer to an array of pages descriptors to be filled by the
function. To perform the lookup operation, find_get_pages( ) relies on the radix_tree_gang_lookup( ) function, which
fills the array of pointers and returns the number of pages found.
The returned pages have ascending indices, although there may be
holes in the indices because some pages may not be in the page
cache.
还有其他几个函数可以对页面缓存执行搜索操作。例如,该find_lock_page( )函数与 类似
find_get_page( ),但它增加了返回页面的使用计数器并调用
lock_page( )设置PG_locked标志,因此,当函数返回时,该页面可以由调用者独占访问。lock_page( )如果页面已被锁定,则该
函数反过来会阻止当前进程。为此,它调用_ _wait_on_bit_lock(
)该位上的函数PG_locked。后一个函数将当前进程置于状态TASK_UNINTERRUPTIBLE,将进程描述符存储在等待队列中,sync_page执行address_space对象拔出包含该文件的块设备的请求队列,最后调用
schedule( )挂起进程,直到PG_locked该页的标志被清除。为了解锁页面并唤醒在等待队列中休眠的进程,内核使用了该unlock_page( )函数。
There are several other functions that perform search
operations on the page cache. For example, the find_lock_page( ) function is similar to
find_get_page( ), but it
increases the usage counter of the returned page and invokes
lock_page( ) to set the PG_locked flag—thus, when the function
returns, the page can be accessed exclusively by the caller. The
lock_page( ) function, in turn,
blocks the current process if the page is already locked. To that
end, it invokes the _ _wait_on_bit_lock(
) function on the PG_locked bit. The latter function puts
the current process in the TASK_UNINTERRUPTIBLE state, stores the
process descriptor in a wait queue, executes the sync_page method of the address_space object to unplug the request
queue of the block device containing the file, and finally invokes
schedule( ) to suspend the
process until the PG_locked flag
of the page is cleared. To unlock a page and wake up the processes
sleeping in the wait queue, the kernel makes use of the unlock_page( ) function.
该find_trylock_page( )
函数与 类似find_lock_page(
),但它从不阻塞:如果请求的页面已被锁定,则该函数返回一个错误代码。最后,该
find_or_create_page( )函数执行find_lock_page( );但是,如果未找到该页面,则会分配一个新页面并将其插入页面缓存中。
The find_trylock_page( )
function is similar to find_lock_page(
), except that it never blocks: if the requested page is
already locked, the function returns an error code. Finally, the
find_or_create_page( ) function
executes find_lock_page( );
however, if the page is not found, a new page is allocated and
inserted in the page cache.
该add_to_page_cache(
)函数在页面缓存中插入一个新的页面描述符。它接收页面描述符的地址、对象的page地址
、表示地址空间内的页面索引的值以及分配基数树的新节点时要使用的内存分配标志作为其参数。该函数执行以下操作:mappingaddress_spaceoffsetgfp_mask
The add_to_page_cache(
) function inserts a new page descriptor in the page
cache. It receives as its parameters the address page of the page descriptor, the address
mapping of an address_space object, the value offset representing the page index inside
the address space, and the memory allocation flags gfp_mask to be used when allocating the
new nodes of the radix tree. The function performs the following
operations:
调用,禁用内核抢占并用一些空闲
结构radix_tree_preload(
)填充每个 CPU 变量。结构的分配
是通过slab分配器缓存完成的。如果预分配结构失败,该
函数将通过返回错误代码来终止。否则,如果成功,
可以确保新页面描述符的插入不会因为缺少可用内存而失败,至少对于大小高达 64 GB 的文件来说是这样。radix_tree_preloadsradix_tree_noderadix_tree_noderadix_tree_node_cachepradix_tree_preload(
)radix_tree_nodeadd_to_page_cache( )-ENOMEMradix_tree_preload( )add_to_page_cache( )
Invokes radix_tree_preload(
), which disables kernel preemption and fills the
per-CPU variable radix_tree_preloads with a few free
radix_tree_node structures.
Allocation of radix_tree_node
structures is done by means of the radix_tree_node_cachep slab allocator
cache. If radix_tree_preload(
) fails in preallocating the radix_tree_node structures, the
add_to_page_cache( ) function
terminates by returning the error code -ENOMEM. Otherwise, if radix_tree_preload( ) succeeds,
add_to_page_cache( ) can be
sure that the insertion of the new page descriptor will not fail
for lack of free memory, at least for files of size up to 64
GB.
获取mapping->tree_lock自旋锁 - 请注意,内核抢占已被 禁用radix_tree_preload( )。
Acquires the mapping->tree_lock spin lock—notice
that kernel preemption has already been disabled by radix_tree_preload( ).
调用radix_tree_insert(
)以在树中插入新节点。该函数执行以下步骤:
调用radix_tree_maxindex(
)获取当前高度的基数树中可插入的最大索引;如果新页面的索引无法用当前高度表示,则调用radix_tree_extend(
)通过添加适当数量的节点来增加树的高度(例如,当应用于图15-1(a)所示的基数树时,
radix_tree_extend( )
将在其顶部添加一个节点)。通过执行该函数来分配新节点radix_tree_node_alloc( ),该函数尝试获取radix_tree_node来自slab分配器缓存的结构,或者如果分配失败,则来自存储在的预分配结构池radix_tree_preloads。
从根 ( mapping->page_tree) 开始,它根据offset页面索引遍历树,直到到达叶,如上一节所述。如果需要,它会通过调用分配新的中间节点
radix_tree_node_alloc(
)。
将页面描述符地址存储在基数树最后遍历的节点的适当槽中,并返回0。
Invokes radix_tree_insert(
) to insert the new node in the tree. This function
performs the following steps:
Invokes radix_tree_maxindex(
) to get the maximum index that can be inserted in
the radix tree with its current height; if the index of the
new page cannot be represented with the current height, it
invokes radix_tree_extend(
) to increase the height of the tree by adding the
proper number of nodes (for instance, when applied to the
radix tree shown in Figure 15-1 (a),
radix_tree_extend( )
would add a single node on top of it). New nodes are
allocated by executing the radix_tree_node_alloc( ) function,
which tries to get a radix_tree_node structure from the
slab allocator cache or, if this allocation fails, from the
pool of preallocated structures stored in radix_tree_preloads.
Starting from the root (mapping->page_tree), it
traverses the tree according to the offset page's index until the leaf
is reached, as described in the previous section. If
required, it allocates new intermediate nodes by invoking
radix_tree_node_alloc(
).
Stores the page descriptor address in the proper slot of the last traversed node of the radix tree, and returns 0.
page->_count增加页面描述符的使用计数器。
Increases the usage counter page->_count of the page
descriptor.
由于该页面是新的,因此其内容无效:该函数设置PG_locked
页框的标志以保护该页面免受来自其他内核控制路径的并发访问。
Because the page is new, its content is invalid: the
function sets the PG_locked
flag of the page frame to protect the page against concurrent
accesses from other kernel control paths.
使用参数
page->mapping和初始化和。page->indexmappingoffset
Initializes page->mapping and page->index with the parameters
mapping and offset.
增加地址空间中缓存页的计数器 ( mapping->nrpages)。
Increases the counter of cached pages in the address space
(mapping->nrpages).
释放地址空间的自旋锁。
Releases the address space's spin lock.
调用radix_tree_preload_end(
)以重新启用内核抢占。
Invokes radix_tree_preload_end(
) to reenable kernel preemption.
返回 0(成功)。
Returns 0 (success).
该remove_from_page_cache(
)函数从页面缓存中删除页面描述符。这是通过以下方式实现的:
The remove_from_page_cache(
) function removes a page descriptor from the page cache.
This is achieved in the following way:
获取page->mapping->tree_lock自旋锁并禁用中断。
Acquires the page->mapping->tree_lock spin
lock and disables interrupts.
调用radix_tree_delete(
)从树中删除节点。该函数接收树根 ( page->mapping->page_tree) 的地址和要删除的页面的索引作为其参数,并执行以下步骤:
从根开始,它根据页面索引遍历树,直到到达叶子,如上一节所述。在此过程中,它构建了一个radix_tree_path结构数组,描述从根到与要删除的页面相对应的叶的路径的组成部分。
从路径数组中收集的节点开始循环,从最后一个节点开始,该节点包含指向页面描述符的指针。对于每个节点,它设置为
NULL指向下一个节点(或页面描述符)的插槽数组的元素并减少该count字段。如果count变为零,则从树中删除该节点并将radix_tree_node结构释放到slab分配器缓存中,然后继续与路径数组中的前一个节点循环;否则,如果count不为零,则继续下一步。
返回指向已从树中删除的页面描述符的指针。
Invokes radix_tree_delete(
) to delete the node from the tree. This function
receives as its parameters the address of the tree's root
(page->mapping->page_tree) and
the index of the page to be removed and performs the following
steps:
Starting from the root, it traverses the tree
according to the page's index until the leaf is reached, as
described in the previous section. While doing so, it builds
up an array of radix_tree_path structures that
describe the components of the path from the root to the
leaf corresponding to the page to be deleted.
Starts a cycle on the nodes collected in the path
array, starting with the last node, which contains the
pointer to the page descriptor. For each node, it sets to
NULL the element of the
slots array pointing to the next node (or to the page
descriptor) and decreases the count field. If count becomes zero, it removes the
node from the tree and releases the radix_tree_node structure to the
slab allocator cache, then continues the cycle with the
preceding node in the path array; otherwise, if count does not become zero, it
continues with the next step.
Returns the pointer to the page descriptor that has been removed from the tree.
将page->mapping字段设置为NULL.
Sets the page->mapping field to NULL.
page->mapping->nrpages将缓存页面的计数器减一。
Decreases by one the page->mapping->nrpages counter
of cached pages.
释放page->mapping->tree_lock自旋锁、启用中断并终止。
Releases the page->mapping->tree_lock spin
lock, enables the interrupts, and terminates.
该read_cache_page(
)函数确保缓存包含给定页面的最新版本。它的参数是一个指向mapping对象的指针,一个指定请求页面的address_space偏移值
,一个
指向从磁盘读取页面数据的函数的指针(通常是实现地址空间方法的函数),以及一个传递的指针到函数(通常是)。以下是该函数功能的简化描述:indexfillerreadpagedatafillerNULL
The read_cache_page(
) function ensures that the cache includes an up-to-date
version of a given page. Its parameters are a pointer mapping to an address_space object, an offset value
index that specifies the
requested page, a pointer filler
to a function that reads the page's data from disk (usually it is
the function that implements the address space's readpage method), and a pointer data that is passed to the filler function (usually, it is NULL). Here is a simplified description of
what the function does:
调用find_get_page(
)以检查页面是否已在页面缓存中。
Invokes find_get_page(
) to check whether the page is already in the page
cache.
如果页面不在页面缓存中,则执行以下子步骤:
调用alloc_pages(
)分配新的页框。
调用add_to_page_cache(
)将相应的页面描述符插入页面缓存。
调用将页面插入到区域的非活动 LRU 列表中(请参阅第 17 章中的“最近最少使用 (LRU) 列表”lru_cache_add(
)部分)。
If the page is not in the page cache, it performs the following substeps:
Invokes alloc_pages(
) to allocate a new page frame.
Invokes add_to_page_cache(
) to insert the corresponding page descriptor into
the page cache.
Invokes lru_cache_add(
) to insert the page in the zone's inactive LRU
list (see the section "The Least Recently
Used (LRU) Lists" in Chapter 17).
这里页面位于页面缓存中。调用mark_page_accessed( )以记录页面已被访问的事实(请参阅第 17 章中的“最近最少使用(LRU)列表”部分)。
Here the page is in the page cache. Invokes mark_page_accessed( ) to record the
fact that the page has been accessed (see the section "The Least Recently Used
(LRU) Lists" in Chapter 17).
如果页面不是最新的(PG_uptodate标志清除),它将调用该filler函数从磁盘读取页面。
If the page is not up-to-date (PG_uptodate flag clear), it invokes
the filler function to read
from disk the page.
返回页面描述符的地址。
Returns the address of the page descriptor.
如前所述,页缓存不仅可以让内核快速检索包含块设备指定数据的页,还可以让内核快速检索包含块设备指定数据的页。缓存还允许内核快速检索缓存中处于给定状态的页面。
As stated previously, the page cache not only allows the kernel to quickly retrieve a page containing specified data of a block device; the cache also allows the kernel to quickly retrieve pages in the cache that are in a given state.
例如,假设内核必须检索高速缓存中属于给定所有者的所有脏页面,即内容尚未写入磁盘的页面。页面描述符中存储的标志PG_dirty指定页面是否脏;然而,如果大多数页面不是脏的,遍历整个基数树以顺序访问所有叶子(即页面描述符)将是一个非常慢的操作。
For instance, let us suppose that the kernel must retrieve all
pages in the cache that belong to a given owner and that are dirty,
that is, the pages whose contents have not yet been written to disk.
The PG_dirty flag stored in the
page descriptor specifies whether a page is dirty or not; however,
traversing the whole radix tree to sequentially access all the
leaves—that is, the page descriptors—would be an unduly slow operation
if most pages are not dirty.
相反,为了允许快速搜索脏页,基数树中的每个中间节点都包含每个子节点(或叶子)的脏标记;当且仅当子节点的至少一个脏标签被设置时,该标志才会被设置。底层节点的脏标记通常是PG_dirty页面描述符标志的副本。这样,当内核遍历基数树寻找脏页时,它可以跳过以脏标记清除的中间节点为根的每个子树:它确定地知道存储在子树中的所有页面描述符都不是脏的。
Instead, to allow a quick search of dirty pages, each
intermediate node in the radix tree contains a dirty tag for each
child node (or leaf); this flag is set if and only if at least one of
the dirty tags of the child node is set. The dirty tags of the nodes
at the bottom level are usually copies of the PG_dirty flags of the page descriptors. In
this way, when the kernel traverses a radix tree looking for dirty
pages, it can skip each subtree rooted at an intermediate node whose
dirty tag is clear: it knows for sure that all page descriptors stored
in the subtree are not dirty.
同样的想法也适用于PG_writeback标志,它表示当前正在将页面写回磁盘。因此,基数树的每个节点传播页面描述符的两个标志:PG_dirty和(参见第 8 章中的“页面描述符”
PG_writeback部分)。为了存储它们,每个节点在字段中包含两个 64 位数组。数组( ) 是脏标记,而( ) 数组是回写标记。tagstags[0]PAGECACHE_TAG_DIRTYtags[1]PAGECACHE_TAG_WRITEBACK
The same idea applies to the PG_writeback flag, which denotes that a page
is currently being written back to disk. Thus, each node of the radix
tree propagates two flags of the page descriptor: PG_dirty and PG_writeback (see the section "Page Descriptors" in
Chapter 8). To store them,
each node includes two arrays of 64 bits in the tags field. The tags[0] array (PAGECACHE_TAG_DIRTY) is the dirty tag, while
the tags[1] (PAGECACHE_TAG_WRITEBACK) array is the
writeback tag.
当设置缓存页面的或标志radix_tree_tag_set( )
时调用该函数;它作用于三个参数:基数树的根、页面的索引以及要设置的标签类型(或)。该函数从树的根开始,向下到给定索引对应的叶子;对于从根到叶的路径的每个节点,该函数设置与指向路径中下一个节点的指针关联的标记。然后该函数返回页面描述符的地址。因此,从根到叶的路径中的每个 in 节点都以适当的方式标记。PG_dirtyPG_writebackPAGECACHE_TAG_DIRTYPAGECACHE_TAG_WRITEBACK
The radix_tree_tag_set( )
function is invoked when setting the PG_dirty or the PG_writeback flag of a cached page; it acts
on three parameters: the root of the radix tree, the page's index, and
the type of tag to be set (PAGECACHE_TAG_DIRTY or PAGECACHE_TAG_WRITEBACK). The function
starts from the root of the tree and goes down to the leaf
corresponding to the given index; for each node of the path leading
from the root to the leaf, the function sets the tag associated with
the pointer to the next node in the path. The function then returns
the address of the page descriptor. As a result, each in node in the
path that goes down from the root to the leaf is tagged in the
appropriate way.
当清除缓存页面的或标志radix_tree_tag_clear( )
时调用该函数;它作用于与 相同的参数。该函数从树的根开始,一直到叶子,构建描述路径的结构数组。然后,该函数从叶子向后执行到根:清除最底层节点的标签,然后检查该节点数组中的所有标签现在是否都被清除;如果是,则清除上层父节点中正确的标签,并检查该节点中的所有标签是否都被清除,依此类推。然后该函数返回页面描述符的地址。PG_dirtyPG_writebackradix_tree_tag_set(
)radix_tree_path
The radix_tree_tag_clear( )
function is invoked when clearing the PG_dirty or the PG_writeback flag of a cached page; it acts
on the same parameters as radix_tree_tag_set(
). The function starts from the root of the tree and goes
down to the leaf, building an array of radix_tree_path structures describing the
path. Then, the function proceeds backward from the leaf to the root:
it clears the tag of the node at the bottom level, then it checks
whether all tags in the node's array are now cleared; if so, the
function clears the proper tag in the parent node at the upper level,
checks whether all tags in that node are cleared, and so on. The
function then returns the address of the page descriptor.
当从基数树中删除页面描述符时,必须更新属于从根到叶的路径的节点中的适当标记。该radix_tree_delete(
)函数正确地执行了此操作(即使我们在上一节中省略了这一事实)。然而,该radix_tree_insert( )函数不会更新标签,因为插入基数树中的每个页面描述符都应该清除PG_dirty和PG_writeback标志。如果有必要,内核可以稍后调用该radix_tree_tag_set( )函数。
When a page descriptor is removed from a radix tree, the proper
tags in the nodes belonging to the path from the root to the leaf must
be updated. The radix_tree_delete(
) function does this properly (even if we omitted mentioning
this fact in the previous section). The radix_tree_insert( ) function, however,
doesn't update the tags, because each page descriptor inserted in the
radix tree is supposed to have the PG_dirty and PG_writeback flags cleared. If necessary,
the kernel may later invoke the radix_tree_tag_set( ) function.
该radix_tree_tagged( )
函数利用树的所有节点中包含的标志数组来测试基数树是否包含给定状态下的至少一页。该函数通过执行以下代码非常简单地执行此任务(root
是指向radix_tree_root
基数树结构的指针,并且tag是要测试的标志):
The radix_tree_tagged( )
function takes advantage of the arrays of flags included in all nodes
of the tree to test whether a radix tree includes at least one page in
a given state. The function performs this task quite simply by
executing the following code (root
is a pointer to the radix_tree_root
structure of the radix tree, and tag is the flag to be tested):
for (idx = 0; idx < 2; idx++) {
if (root->rnode->tags[tag][idx])
返回1;
}
返回0; for (idx = 0; idx < 2; idx++) {
if (root->rnode->tags[tag][idx])
return 1;
}
return 0;因为可以假设基数树的所有节点的标签都被正确更新,所以radix_tree_tagged(
)只需要检查级别1的节点的标签。使用该函数的一个例子是判断一个inode是否包含要写入的脏页到磁盘。unsigned long请注意,在每次迭代中,该函数都会测试是否设置了存储在 an 中的 32 个标志中的任何一个
。
Because the tags of all nodes of the radix tree can be assumed
to be properly updated, radix_tree_tagged(
) needs only to check the tags of the node at level 1. An
example of use of such function occurs when determining whether an
inode contains dirty pages to be written to disk. Notice that in each
iteration the function tests whether any of the 32 flags stored in an
unsigned long is set.
该find_get_pages_tag( )
函数与 类似,find_get_pages(
)只是它仅返回用
tag参数标记的页面。正如我们将在“将脏页写入磁盘”部分中看到的,此函数对于快速识别 inode 的所有脏页至关重要。
The find_get_pages_tag( )
function is similar to find_get_pages(
) except that it returns only pages that are tagged with the
tag parameter. As we'll see in the
section "Writing Dirty
Pages to Disk," this function is crucial to quickly identify
all the dirty pages of an inode.
[ * ]好吧,几乎从来没有:如果你想在关闭后恢复系统的整个状态,你可以执行“挂起到磁盘”操作(休眠 ),它将整个 RAM 的内容保存在交换分区上。我们不会进一步讨论这个案例。
[*] Well, almost never: if you want to resume the whole state of the system after a shutdown, you can perform a "suspend to disk" operation (hibernation ), which saves the content of the whole RAM on a swap partition. We won't further discuss this case.
[ * ]已换出的页面发生异常。正如我们将在第 17 章中看到的,这些页面有一个address_space不包含在任何 inode 中的公共对象。
[*] An exception occurs for pages that have been swapped out. As
we will see in Chapter
17, these pages have a common address_space object not included in any
inode.
我们在第 14 章的“块设备处理” 一节中看到,VFS、映射层和各种文件系统将磁盘数据分组为称为“块”的逻辑单元。
We have seen in the section "Block Devices Handling" in Chapter 14 that the VFS, the mapping layer, and the various filesystems group the disk data in logical units called "blocks."
在旧版本的 Linux 内核中,有两种不同的主要磁盘缓存:页缓存(存储因访问磁盘文件内容而产生的整页磁盘数据)和缓冲区 缓存 ,它用于将 VFS 访问的块的内容保存在内存中,以管理基于磁盘的文件系统。
In old versions of the Linux kernel, there were two different main disk caches: the page cache, which stored whole pages of disk data resulting from accesses to the contents of the disk files, and the buffer cache , which was used to keep in memory the contents of the blocks accessed by the VFS to manage the disk-based filesystems.
从稳定版本 2.4.10 开始,缓冲区缓存实际上不再存在。事实上,出于效率的考虑,块缓冲区不再单独分配;相反,它们存储在称为“缓冲区页面”的专用页面中,”保存在页面缓存中。
Starting from stable version 2.4.10, the buffer cache does not really exist anymore. In fact, for reasons of efficiency, block buffers are no longer allocated individually; instead, they are stored in dedicated pages called "buffer pages ," which are kept in the page cache.
形式上,缓冲区页是与称为“缓冲区头”的附加描述符相关联的数据页。”,其主要目的是快速定位页面中每个单独块的磁盘地址。实际上,属于页面缓存的页面中存储的数据块在磁盘上不一定是相邻的。
Formally, a buffer page is a page of data associated with additional descriptors called "buffer heads ," whose main purpose is to quickly locate the disk address of each individual block in the page. In fact, the chunks of data stored in a page belonging to the page cache are not necessarily adjacent on disk.
每个块缓冲区都有一个
类型为 的缓冲区头buffer_head描述符。该描述符包含内核了解如何处理块所需的所有信息;因此,在对每个块进行操作之前,内核会检查其缓冲区头。缓冲区头的字段如表15-4所示。
Each block buffer has a buffer head
descriptor of type buffer_head.
This descriptor contains all the information needed by the kernel to
know how to handle the block; thus, before operating on each block,
the kernel checks its buffer head. The fields of a buffer head are
listed in Table
15-4.
表 15-4。缓冲区头的字段
Table 15-4. The fields of a buffer head
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 缓冲区状态标志 Buffer status flags |
| | 指向缓冲区页面列表中下一个元素的指针 Pointer to the next element in the buffer page's list |
| | 指向保存该块的缓冲区页描述符的指针 Pointer to the descriptor of the buffer page holding this block |
| | 块使用计数器 Block usage counter |
| | 块大小 Block size |
| | 相对于块设备的块号(逻辑块号) Block number relative to the block device (logical block number) |
| | 缓冲区页内块的位置 Position of the block inside the buffer page |
| | 指向块设备描述符的指针 Pointer to block device descriptor |
| | I/O完成方式 I/O completion method |
| | 指向 I/O 完成方法的数据的指针 Pointer to data for the I/O completion method |
| | 与 inode 关联的间接块列表的指针(请参阅本章前面的“ address_space 对象”部分) Pointers for the list of indirect blocks associated with an inode (see the section "The address_space Object" earlier in this chapter) |
缓冲区头的两个字段编码块的磁盘地址:该b_bdev字段标识包含该块的块设备(通常是磁盘或分区)(请参阅第 14 章中的“块设备”部分),而该字段存储逻辑块号,即块在其磁盘或分区内的索引。b_blocknr
Two fields of the buffer head encode the disk address of the
block: the b_bdev field identifies
the block device—usually, a disk or a partition—that contains the
block (see the section "Block Devices" in Chapter 14), while the b_blocknr field stores the logical
block number, that is, the index of the block inside its
disk or partition.
该b_data字段指定块缓冲区在缓冲区页内的位置。实际上,这个位置的编码取决于该页是否在高端内存中。如果页面位于高端内存,则该b_data字段包含块缓冲区相对于页面开头的偏移量;否则,
b_data包含块缓冲区的线性地址。
The b_data field specifies
the position of the block buffer inside the buffer page. Actually, the
encoding of this position depends on whether the page is in high
memory or not. If the page is in high memory, the b_data field contains the offset of the
block buffer with respect to the beginning of the page; otherwise,
b_data contains the linear address
of the block buffer.
该b_state字段可以存储多个标志。其中一些是通用的,列于表15-5。每个文件系统还可以定义自己的私有缓冲区头标志。
The b_state field may store
several flags. Some of them are of general use and are listed in Table 15-5. Each
filesystem may also define its own private buffer head flags.
表 15-5。缓冲区头的通用标志
Table 15-5. The buffer head's general flags
缓冲区头有自己的slab分配器缓存,其kmem_cache_s描述符存储在bh_cachep变量中。和函数分别用于获取和释放缓冲区头alloc_buffer_head( )。
free_buffer_head( )
The buffer heads have their own slab allocator cache,
whose kmem_cache_s descriptor is
stored in the bh_cachep variable.
The alloc_buffer_head( ) and
free_buffer_head( ) functions are
used to get and release a buffer head, respectively.
缓冲区头的字段b_count是相应块缓冲区的使用计数器。计数器在块缓冲区的每次操作之前增加,并在之后减少。定期检查页面高速缓存中保存的块缓冲区,并在空闲内存变得稀缺时检查,并且只有具有空使用计数器的块缓冲区可以被回收(参见第17章)。
The b_count field of the
buffer head is a usage counter for the corresponding block buffer. The
counter is increased right before each operation on the block buffer
and decreased right after. The block buffers kept in the page cache
are examined both periodically and when free memory becomes scarce,
and only the block buffers having null usage counters may be reclaimed
(see Chapter 17).
当内核控制路径希望访问块缓冲区时,它应该首先增加使用计数器。在页面缓存中定位块的函数(参见本章后面的“在页面缓存中搜索块_ _getblk(
)”部分)自动执行此操作,因此更高级别的函数通常不会增加块缓冲区的使用计数器。
When a kernel control path wishes to access a block buffer, it
should first increase the usage counter. The function that locates a
block inside the page cache (_ _getblk(
); see the section "Searching Blocks in the Page
Cache" later in this chapter) does this automatically, hence
the higher-level functions do not usually increase the block buffer's
usage counter.
当内核控制路径停止访问块缓冲区时,它应该调用_ _brelse( )
或_ _bforget( )来减少相应的使用计数器。这两个函数之间的区别在于 _ _还会从任何间接块列表(缓冲区头字段;参见上一节“块缓冲区和缓冲区头bforget( )”)
中删除该块,并将缓冲区标记为干净,从而迫使内核忘记缓冲区中尚未写入磁盘的任何更改。b_assoc_buffers
When a kernel control path stops accessing a block buffer, it
should invoke either _ _brelse( )
or _ _bforget( ) to decrease the
corresponding usage counter. The difference between these two
functions is that _ _bforget( )
also removes the block from any list of indirect blocks (b_assoc_buffers buffer head field; see the
previous section "Block
Buffers and Buffer Heads") and marks the buffer as clean, thus
forcing the kernel to forget any change in the buffer that has yet to
be written on disk.
每当内核必须单独寻址一个块时,它会引用保存该块缓冲区的缓冲区页并检查相应的缓冲区头。
Whenever the kernel must individually address a block, it refers to the buffer page that holds the block buffer and checks the corresponding buffer head.
以下是内核创建缓冲区页的两种常见情况:
Here are two common cases in which the kernel creates buffer pages:
读取或写入未存储在连续磁盘块中的文件页面时。发生这种情况的原因要么是文件系统为文件分配了不连续的块,要么是因为文件包含“漏洞”(请参阅第18 章中的“文件漏洞”部分)。
When reading or writing pages of a file that are not stored in contiguous disk blocks. This happens either because the filesystem has allocated noncontiguous blocks to the file, or because the file contains "holes" (see the section "File Holes" in Chapter 18).
当访问单个磁盘块时(例如,读取超级块或索引节点块时)。
When accessing a single disk block (for instance, when reading a superblock or an inode block).
在第一种情况下,缓冲区页面的描述符被插入到常规文件的基数树中。缓冲区头被保留,因为它们存储了宝贵的信息:指定数据在磁盘中位置的块设备和逻辑块号。我们将在第 16 章中看到内核如何使用这种类型的缓冲区页。
In the first case, the buffer page's descriptor is inserted in the radix tree of a regular file. The buffer heads are preserved because they store precious information: the block device and the logical block number that specify the position of the data in the disk. We will see how the kernel makes use of this type of buffer page in Chapter 16.
在第二种情况下,缓冲区页面的描述符被插入到以bdevaddress_space中的 inode 对象为根的
基数树中 与块设备关联的特殊文件系统(请参阅本章前面的“地址空间对象”部分)。这种缓冲区页必须满足一个强约束:所有块缓冲区必须引用底层块设备的相邻块。
In the second case, the buffer page's descriptor is inserted in
the radix tree rooted at the address_space object of the inode in the
bdev special filesystem associated with the block device
(see the section "The
address_space Object" earlier in this chapter). This kind of
buffer pages must satisfy a strong constraint: all the block buffers
must refer to adjacent blocks of the underlying block device.
当 VFS 想要读取包含给定文件的 inode 的 1,024 字节 inode 块时,此功能很有用。内核必须分配存储四个缓冲区的整个页面,而不是分配单个缓冲区;这些缓冲区将包含块设备上一组四个相邻块的数据,包括请求的 inode 块。
An instance of where this is useful is when the VFS wants to read the 1,024-byte inode block containing the inode of a given file. Instead of allocating a single buffer, the kernel must allocate a whole page storing four buffers; these buffers will contain the data of a group of four adjacent blocks on the block device, including the requested inode block.
在本章中,我们将重点关注第二种类型的缓冲区页,即所谓的块设备缓冲区页 (有时缩写为blockdev 的页面)。
In this chapter we will focus our attention on the second type of buffer pages, the so-called block device buffer pages (sometimes shortened to blockdev's pages).
单个缓冲区页内的所有块缓冲区必须具有相同的大小;因此,在 80 × 86 架构上,一个缓冲区页可以包含 1 到 8 个缓冲区,具体取决于块大小。
All the block buffers within a single buffer page must have the same size; hence, on the 80 × 86 architecture, a buffer page can include from one to eight buffers, depending on the block size.
当页面充当缓冲区页面时,与其块缓冲区关联的所有缓冲区头都收集在单链循环列表中。缓冲区页描述符的字段private指向该页中第一个块的缓冲区头;[ * ]每个缓冲区头在该b_this_page字段中存储一个指向列表中下一个缓冲区头的指针。此外,每个缓冲区头在该字段中存储缓冲区页描述符的地址b_page。图 15-2显示了包含四个块缓冲区和相应缓冲区头的缓冲区页。
When a page acts as a buffer page, all buffer heads associated
with its block buffers are collected in a singly linked circular list.
The private field of the descriptor
of the buffer page points to the buffer head of the first block in the
page;[*] every buffer head stores in the b_this_page field a pointer to the next
buffer head in the list. Moreover, every buffer head stores the
address of the buffer page's descriptor in the b_page field. Figure 15-2 shows a buffer
page containing four block buffers and the corresponding buffer
heads.
当内核发现页面缓存不包含包含给定块的缓冲区的页面时,内核会分配一个新的块设备缓冲区页面(请参阅本章后面的“在页面缓存中搜索块”部分)。特别是,由于以下原因,块的查找操作可能会失败:
The kernel allocates a new block device buffer page when it discovers that the page cache does not include a page containing the buffer for a given block (see the section "Searching Blocks in the Page Cache" later in this chapter). In particular, the lookup operation for the block might fail for the following reasons:
块设备的基数树不包括包含块数据的页面:在这种情况下,必须将新的页面描述符添加到基数树中。
The radix tree of the block device does not include a page containing the data of the block: in this case a new page descriptor must be added to the radix tree.
块设备的基数树包括一个包含块数据的页面,但该页面不是缓冲区页面:在这种情况下,必须分配新的缓冲区头并将其链接到该页面,从而将其转换为块设备缓冲区页面。
The radix tree of the block device includes a page containing the data of the block, but this page is not a buffer page: in this case new buffer heads must be allocated and linked to the page, thus transforming it into a block device buffer page.
块设备的基数树包括一个包含块数据的缓冲区页,但该页已被分割成大小与请求块大小不同的块:在这种情况下,必须释放旧的缓冲区头,并且必须分配新的一组缓冲区头并将其链接到该页。
The radix tree of the block device includes a buffer page containing the data of the block, but the page has been split in blocks of size different from the size of the requested block: in this case the old buffer heads must be released, and a new set of buffer heads must be allocated and linked to the page.
为了将块设备缓冲区页面添加到页面缓存,内核调用该grow_buffers(
)函数,该函数接收标识该块的三个参数:
In order to add a block device buffer page to the page cache,
the kernel invokes the grow_buffers(
) function, which receives three parameters that identify
the block:
描述符bdev的
地址block_device
The address bdev of the
block_device descriptor
逻辑块号block——块在块设备中的位置
The logical block number block — the position of the block inside
the block device
块大小size
The block size size
该函数主要执行以下操作:
The function essentially performs the following actions:
index
计算包含所请求块的块设备内数据页的偏移量。
Computes the offset index
of the page of data within the block device that includes the
requested block.
如有必要,调用grow_dev_page( )
以创建新的块设备缓冲区页面。该函数依次执行以下子步骤:
调用find_or_create_page(
),并向其传递address_space块设备 ( bdev->bd_inode->i_mapping) 的对象、页偏移量index和GFP_NOFS标志。如前面的“页面缓存处理函数”部分所述,find_or_create_page( )在页面缓存中查找页面,并在必要时在缓存中插入新页面。
现在所需的页面位于页面缓存中,并且该函数具有其描述符的地址。该函数检查其PG_private标志;如果是NULL,则该页还不是缓冲区页(它没有关联的缓冲区头):它跳转到步骤 2e。
该页已经是缓冲区页。从其描述符字段中获取
第一个缓冲区头的
private地址,并检查块大小是否等于请求块的大小;如果是,则在页高速缓存中找到的页是有效的缓冲页:它跳转到步骤2g。bhbh->size
该页具有错误大小的块:它调用
try_to_free_buffers( )(参见下一节)来释放缓冲区页的先前缓冲区头。
调用该alloc_page_buffers(
)函数为页面内请求大小的块分配缓冲区头,并将它们插入到由字段实现的单链循环列表中
b_this_page。此外,该函数使用页面描述符的地址初始化b_page缓冲区头的字段,以及b_data使用页面内块缓冲区的偏移量或线性地址的字段。
将第一个缓冲区头的地址存储在字段中
private,设置该
PG_private字段,并增加页面的使用计数器(页面内的块缓冲区计为页面用户)。
调用该init_page_buffers(
)函数来初始化链接到该页的缓冲区头的b_bdev、b_blocknr和字段。b_bstate所有块在磁盘上都是相邻的,因此逻辑块号是连续的并且可以很容易地从 导出block。
返回页面描述符地址。
Invokes grow_dev_page( )
to create a new block device buffer page, if necessary. In turn,
this function performs the following substeps:
Invokes find_or_create_page(
), passing to it the address_space object of the block
device (bdev->bd_inode->i_mapping),
the page offset index, and
the GFP_NOFS flag. As
described in the earlier section "Page Cache Handling
Functions," find_or_create_page( ) looks for the
page in the page cache and, if necessary, inserts a new page
in the cache.
Now the required page is in the page cache, and the
function has the address of its descriptor. The function
checks its PG_private flag;
if it is NULL, the page is
not yet a buffer page (it has no associated buffer heads): it
jumps to step 2e.
The page is already a buffer page. Gets from the
private field of its
descriptor the address bh
of the first buffer head, and checks whether the block size
bh->size is equal to the
size of the requested block; if so, the page found in the page
cache is a valid buffer page: it jumps to step 2g.
The page has blocks of the wrong size: it invokes
try_to_free_buffers( ) (see
the next section) to release the previous buffer heads of the
buffer page.
Invokes the alloc_page_buffers(
) function to allocate the buffer heads for the
blocks of the requested size within the page and insert them
into the singly linked circular list implemented by the
b_this_page fields.
Moreover, the function initializes the b_page fields of the buffer heads
with the address of the page descriptor, and the b_data fields with the offset or
linear address of the block buffer inside the page.
Stores the address of the first buffer head in the
private field, sets the
PG_private field, and
increases the usage counter of the page (the block buffers
inside the page counts as a page user).
Invokes the init_page_buffers(
) function to initialize the b_bdev, b_blocknr, and b_bstate fields of the buffer heads
linked to the page. All blocks are adjacent on disk, hence the
logical block numbers are consecutive and can be easily
derived from block.
Returns the page descriptor address.
解锁页面(该页面被 锁定find_or_create_page( ))。
Unlocks the page (the page was locked by find_or_create_page( )).
减少页面的使用计数器(计数器再次增加find_or_create_page(
))。
Decreases the page's usage counter (again, the counter was
increased by find_or_create_page(
)).
返回 1(成功)。
Returns 1 (success).
正如我们将在第 17 章中看到的,当内核尝试获取额外的可用内存时,块设备缓冲区页面将被释放。显然,如果缓冲区页包含脏缓冲区或锁定缓冲区,则无法释放该缓冲区页。为了释放缓冲区页面,内核调用该
try_to_release_page( )函数,该函数接收页面描述符的地址page并执行以下操作:[ * ]
As we will see in Chapter 17, block device buffer
pages are released when the kernel tries to get additional free
memory. Clearly a buffer page cannot be freed if it contains dirty or
locked buffers. To release buffer pages, the kernel invokes the
try_to_release_page( ) function,
which receives the address page of
a page descriptor and performs the following actions:[*]
如果PG_writeback设置了该页的标志,则返回 0(无法释放,因为该页正在写回磁盘)。
If the PG_writeback flag
of the page is set, it returns 0 (no release is possible because
the page is being written back to disk).
如果定义,它将调用releasepage块设备
address_space对象的方法。(通常没有为块设备定义该方法。)
If defined, it invokes the releasepage method of the block device's
address_space object. (The
method is usually not defined for block devices.)
调用该try_to_free_buffers(
)函数并返回其错误代码。
Invokes the try_to_free_buffers(
) function, and returns its error code.
依次,该try_to_free_buffers(
)函数扫描链接到缓冲区页的缓冲区头;它主要执行以下操作:
In turn, the try_to_free_buffers(
) function scans the buffer heads linked to the buffer page;
it performs essentially the following actions:
检查页面中包含的所有缓冲区的缓冲区头的标志。如果某个缓冲区头设置了BH_Dirty或BH_Locked标志,则该函数将通过返回 0(失败)终止:无法释放缓冲区。
Checks the flags of all the buffer heads of buffers included
in the page. If some buffer head has the BH_Dirty or BH_Locked flag set, the function
terminates by returning 0 (failure): it is not possible to release
the buffers.
如果将缓冲区头插入到间接缓冲区列表中(请参阅本章前面的“块缓冲区和缓冲区头”部分),则该函数会将其从列表中删除。
If a buffer head is inserted in a list of indirect buffers (see the section "Block Buffers and Buffer Heads" earlier in this chapter), the function removes it from the list.
清除PG_private
页面描述符的标志,将该private字段设置为NULL,并减少页面的使用计数器。
Clears the PG_private
flag of the page descriptor, sets the private field to NULL, and decreases the page's usage
counter.
清除PG_dirty页面的标志。
Clears the PG_dirty flag
of the page.
重复调用free_buffer_head( )页面的缓冲区头以释放所有缓冲区头。
Invokes repeatedly free_buffer_head( ) on the buffer heads
of the page to free all of them.
返回 1(成功)。
Returns 1 (success).
当内核需要读取或写入物理设备的单个块(例如超级块)时,它必须检查所需的块缓冲区是否已包含在页缓存中。bdev在页缓存中搜索给定的块缓冲区(由块设备描述符的地址和逻辑块号指定nr)是一个三阶段过程:
When the kernel needs to read or write a single block of a
physical device (for instance, a superblock), it must check whether
the required block buffer is already included in the page cache.
Searching the page cache for a given block buffer—specified by the
address bdev of a block device
descriptor and by a logical block number nr—is a three stage process:
address_space获取指向包含块 ( ) 的块设备对象的指针bdev->bd_inode->i_mapping。
Get a pointer to the address_space object of the block device
containing the block (bdev->bd_inode->i_mapping).
获取设备的块大小 ( bdev->bd_block_size),并计算包含该块的页面的索引。这始终是对逻辑块号的位移操作。例如,如果块大小为 1,024 字节,则每个缓冲区页包含四个块缓冲区,因此该页的索引为nr/4。
Get the block size of the device (bdev->bd_block_size), and compute the
index of the page that contains the block. This is always a bit
shift operation on the logical block number. For instance, if the
block size is 1,024 bytes, each buffer page contains four block
buffers, thus the page's index is nr/4.
在块设备的基数树中搜索缓冲区页。获得页面描述符后,内核可以访问描述页面内块缓冲区状态的缓冲区头。
Searches for the buffer page in the radix tree of the block device. After obtaining the page descriptor, the kernel has access to the buffer heads that describe the status of the block buffers inside the page.
然而,细节比这稍微复杂一些。为了增强系统性能,内核管理一系列bh_lrus小型磁盘缓存,每个CPU一个,称为最近最少使用(LRU)块缓存。每个磁盘高速缓存包含八个指针,指向给定 CPU 最近访问过的缓冲区头。每个 CPU 数组中的元素都经过排序,以便指向最近使用的缓冲区头的指针的索引为 0。同一个缓冲区头可能会出现在多个 CPU 数组上(但绝不会在同一个 CPU 数组中出现两次);对于 LRU 块缓存中每次出现的缓冲区头,缓冲区头的b_count使用计数器加一。
Details are slightly more complicated than this, however. In
order to enhance system performance, the kernel manages a bh_lrus array of small disk caches , one for each CPU, called the Least Recently
Used (LRU) block cache. Each disk cache contains eight
pointers to buffer heads that have been recently accessed by a given
CPU. The elements in each CPU array are sorted so that the pointer to
the most recently used buffer head has index 0. The same buffer head
might appear on several CPU arrays (but never twice in the same CPU
array); for each occurrence of a buffer head in the LRU block
cache , the buffer head's b_count usage counter is increased by
one.
该函数接收描述符的_ _find_get_block(
)地址、块号
和块大小
作为其参数,并返回与页缓存内的块缓冲区关联的缓冲区头的地址,或者如果不存在这样的块缓冲区。该函数主要执行以下操作:bdevblock_deviceblocksizeNULL
The _ _find_get_block(
) function receives as its parameters the address bdev of a block_device descriptor, the block number
block, and the block size
size, and returns the address of
the buffer head associated with the block buffer inside the page
cache, or NULL if no such block
buffer exists. The function performs essentially the following
actions:
检查正在执行的 CPU 的 LRU 块缓存阵列是否包含b_bdev、b_blocknr、 和b_size字段分别等于bdev、block、 和 的缓冲区头size。
Checks whether the LRU block cache array of the executing
CPU includes a buffer head whose b_bdev, b_blocknr, and b_size fields are equal to bdev, block, and size, respectively.
如果缓冲区头位于 LRU 块缓存中,则重新整理数组中的元素,以便将指向刚刚发现的缓冲区头的指针放在第一个位置(索引 0),增加其字段,然后跳转到步骤 8 b_count。
If the buffer head is in the LRU block cache, it
reshuffles the elements in the array so as to put the pointer to
the just discovered buffer head in the first position (index 0),
increases its b_count field,
and jumps to step 8.
这里缓冲区头不在 LRU 块缓存中:它从块号和块大小导出相对于块设备的页索引,如下所示:
索引 = 块 >> (PAGE_SHIFT - bdev->bd_inode->i_blkbits);
Here the buffer head is not in the LRU block cache: it derives from the block number and the block size the page index relative to the block device as:
index = block >> (PAGE_SHIFT - bdev->bd_inode->i_blkbits);
调用find_get_page(
)以在页缓存中定位包含所需块缓冲区的缓冲区页的描述符。address_space该函数将指向块设备对象的指针( bdev->bd_inode->i_mapping) 和页索引作为参数传递,以在页缓存中定位包含所需块缓冲区的缓冲区页的描述符。如果缓存中没有这样的页面,则返回NULL(失败)。
Invokes find_get_page(
) to locate, in the page cache, the descriptor of the
buffer page containing the required block buffer. The function
passes as parameters a pointer to the address_space object of the block
device (bdev->bd_inode->i_mapping) and
the page index to locate in the page cache the descriptor of the
buffer page containing the required block buffer. If there is no
such page in the cache, returns NULL (failure).
此时,该函数具有缓冲区页描述符的地址:它扫描链接到缓冲区页的缓冲区头列表,查找逻辑块号等于的块block。
At this point, the function has the address of a
descriptor for the buffer page: it scans the list of buffer
heads linked to the buffer page, looking for the block having
logical block number equal to block.
减少count
页面描述符的字段(增加了find_get_page( ))。
Decreases the count
field of the page descriptor (it was increased by find_get_page( )).
将 LRU 块缓存中的所有元素向下移动一个位置,并将指向所请求块的缓冲区头的指针插入到第一个位置。如果缓冲区头已从 LRU 块缓存中删除,则会减少其b_count使用计数器。
Moves all elements in the LRU block cache one position
down, and inserts the pointer to the buffer head of the
requested block in the first position. If a buffer head has been
dropped out of the LRU block cache, it decreases its b_count usage counter.
如有必要,调用将缓冲区页面移动到正确的 LRU 列表中(请参阅第 17 章中的“最近最少使用(LRU)列表”mark_page_accessed(
)部分)。
Invokes mark_page_accessed(
) to move the buffer page in the proper LRU list, if
necessary (see the section "The Least Recently Used
(LRU) Lists" in Chapter 17).
返回缓冲区头指针。
Returns the buffer head pointer.
该_ _getblk( )
函数接收与 相同的参数,即描述符的_
_find_get_block( )地址、块号
和块大小
,并返回与缓冲区关联的缓冲区头的地址。该函数永远不会失败:即使该块根本不存在,也会主动分配一个块设备缓冲区页并返回一个指向应该描述该块的缓冲区头的指针。请注意,返回的块缓冲区
不一定包含有效数据——缓冲区头的标志可能会被清除。bdevblock_deviceblocksize_ _getblk( )_ _getblk( )BH_Uptodate
The _ _getblk( )
function receives the same parameters as _
_find_get_block( ), namely the address bdev of a block_device descriptor, the block number
block, and the block size
size, and returns the address of
a buffer head associated with the buffer. The function never fails:
even if the block does not exist at all, the _ _getblk( ) obligingly allocates a block
device buffer page and returns a pointer to the buffer head that
should describe the block. Notice that the block buffer returned by
_ _getblk( ) does not necessarily
contain valid data—the BH_Uptodate flag of the buffer head might
be cleared.
该_ _getblk( )函数主要执行以下步骤:
The _ _getblk( ) function
essentially performs the following steps:
调用_ _find_get_block(
)以检查该块是否已在页面缓存中。如果找到该块,该函数将返回其缓冲区头的地址。
Invokes _ _find_get_block(
) to check whether the block is already in the page
cache. If the block is found, the function returns the address
of its buffer head.
否则,它会调用为请求的块分配一个新的缓冲区页面(请参阅本章前面的“分配块设备缓冲区页面grow_buffers( )”部分)。
Otherwise, it invokes grow_buffers( ) to allocate a new
buffer page for the requested block (see the section "Allocating Block Device
Buffer Pages" earlier in this chapter).
如果grow_buffers( )
分配这样的页面失败,_
_getblk( )尝试通过调用回收一些内存
free_more_memory( )(参见
第17章)。
If grow_buffers( )
fails in allocating such a page, _
_getblk( ) tries to reclaim some memory by invoking
free_more_memory( ) (see
Chapter 17).
跳回步骤 1。
Jumps back to step 1.
该_ _bread( )函数接收与 相同的参数,即描述符的_ _getblk(
)地址、块号
和块大小
,并返回与缓冲区关联的缓冲区头的地址。与 相反,如果需要,该函数会在返回缓冲区头之前从磁盘读取块。该
函数执行以下步骤:bdevblock_deviceblocksize_ _getblk( )_ _bread( )
The _ _bread( ) function
receives the same parameters as _ _getblk(
), namely the address bdev of a block_device descriptor, the block number
block, and the block size
size, and returns the address of
a buffer head associated with the buffer. Contrary to _ _getblk( ), the function reads the block
from disk, if necessary, before returning the buffer head. The
_ _bread( ) function performs the
following steps:
调用_ _getblk( )以在页缓存中查找与所需块关联的缓冲区页,并获取指向相应缓冲区头的指针。
Invokes _ _getblk( ) to
find in the page cache the buffer page associated with the
required block and to get a pointer to the corresponding buffer
head.
如果该块已在页缓存中并且缓冲区包含有效数据(BH_Uptodate设置标志),则返回缓冲区头的地址。
If the block is already in the page cache and the buffer
contains valid data (flag BH_Uptodate set), it returns the
address of the buffer head.
否则,它会增加缓冲区头的使用计数器。
Otherwise, it increases the usage counter of the buffer head.
将字段设置b_end_io为地址end_buffer_read_sync(
)(请参阅下一节)。
Sets the b_end_io field
to the address of end_buffer_read_sync(
) (see the next section).
调用submit_bh( )将缓冲区头传输到通用块层(请参阅下一节)。
Invokes submit_bh( ) to
transmit the buffer head to the generic block layer (see next
section).
调用wait_on_buffer(
)将当前进程放入等待队列,直到读I/O操作完成,即直到BH_Lock清除缓冲区头的标志。
Invokes wait_on_buffer(
) to put the current process in a wait queue until the
read I/O operation is completed, that is, until the BH_Lock flag of the buffer head is
cleared.
返回缓冲区头的地址。
Returns the address of the buffer head.
几个函数submit_bh( )和ll_rw_block( )允许内核在由其缓冲区头描述的一个或多个缓冲区上启动 I/O 数据传输。
A couple of functions, submit_bh( ) and ll_rw_block( ), allow the kernel to start an
I/O data transfer on one or more buffers described by their buffer
heads.
为了将单个缓冲区头传输到通用块层,从而要求传输单个数据块,内核使用了该函数submit_bh(
)。它的参数是数据传输的方向(本质上是READ或
WRITE)和一个bh指向描述块缓冲区的缓冲区头的指针。
To transmit a single buffer head to the generic block layer,
and thus to require the transfer of a single block of data, the
kernel makes use of the submit_bh(
) function. Its parameters are the direction of data
transfer (essentially READ or
WRITE) and a pointer bh to the buffer head describing the block
buffer.
该submit_bh( )函数假设缓冲区头已完全初始化;特别是,必须正确设置b_bdev、b_blocknr和b_size字段以识别磁盘上包含所请求数据的块。如果块缓冲区属于块设备缓冲区页,则缓冲区头的初始化由 完成_ _find_get_block( ),如上一节所述。然而,正如我们将在下一章中看到的,
submit_bh( )也可以在属于常规文件拥有的缓冲区页的块上调用。
The submit_bh( ) function
assumes that the buffer head is fully initialized; in particular,
the b_bdev, b_blocknr, and b_size fields must be properly set to
identify the block on disk containing the requested data. If the
block buffer belongs to a block device buffer page, the
initialization of the buffer head is done by _ _find_get_block( ), as described in the
previous section. However, as we will see in the next chapter,
submit_bh( ) can also be invoked
on blocks belonging to buffer pages owned by regular files.
该submit_bh( )函数只不过是一个粘合函数,它根据缓冲区头的内容创建一个bio请求,然后调用(参见第14章中的“提交请求”generic_make_request( )部分)。其执行的主要步骤如下:
The submit_bh( ) function
is little else than a glue function that creates a bio request from
the contents of the buffer head and then invokes generic_make_request( ) (see the section
"Submitting a
Request" in Chapter
14). The main steps performed by it are the following:
设置BH_Req缓冲区头的标志来记录该块至少已被提交一次;此外,如果数据传输方向为WRITE,则清除该BH_Write_EIO标志。
Sets the BH_Req flag of
the buffer head to record that the block has been submitted at
least one time; moreover, if the direction of the data transfer
is WRITE, clears the BH_Write_EIO flag.
Invokes bio_alloc( ) to
allocate a new bio descriptor
(see the section "The Bio Structure"
in Chapter
14).
bio根据缓冲区头的内容初始化描述符的字段:
将bi_sector
字段设置为块中第一个扇区的编号
(bh->b_blocknr * bh->b_size /
512);
bi_bdev
使用块设备描述符的地址设置该字段 ( bh->b_bdev);
设置bi_size
块大小的字段(bh->b_size);
初始化数组的第一个元素bi_io_vec,以便该段对应于块缓冲区:bi_io_vec[0].bv_page设置为
bh->b_page、bi_io_vec[0].bv_len设置为
bh->b_size、
bi_io_vec[0].bv_offset设置为 指定的页中块缓冲区的偏移量bh->b_data;
设置bi_vcnt为 1(仅 Bio 上的一个段)和bi_idx0(要传输的当前段);
设置该bi_end_io
字段为 的地址end_bio_bh_io_sync( ),并将该bi_private字段设置为缓冲区头的地址;当数据传输终止时将调用该函数(见下文)。
Initializes the fields of the bio descriptor according to the
contents of the buffer head:
Sets the bi_sector
field to the number of the first sector in the block
(bh->b_blocknr * bh->b_size /
512);
Sets the bi_bdev
field with the address of the block device descriptor
(bh->b_bdev);
Sets the bi_size
field with the block size (bh->b_size);
Initializes the first element of the bi_io_vec array so that the
segment corresponds to the block buffer: bi_io_vec[0].bv_page is set to
bh->b_page, bi_io_vec[0].bv_len is set to
bh->b_size, and
bi_io_vec[0].bv_offset is
set to the offset of the block buffer in the page as
specified by bh->b_data;
Sets bi_vcnt to 1
(just one segment on the bio), and bi_idx to 0 (the current segment
to be transferred);
Sets the bi_end_io
field to the address of end_bio_bh_io_sync( ), and sets
the bi_private field to
the address of the buffer head; the function will be invoked
when the data transfer terminates (see below).
增加 Bio 的参考计数器(等于 2)。
Increases the reference counter of the bio (it becomes equal to 2).
Invokes submit_bio( ),它bi_rw使用数据传输方向设置标志,更新page_states每个 CPU 变量以跟踪读取和写入的扇区数,并调用
描述符generic_make_request( )
上的函数bio
。
Invokes submit_bio( ),
which sets the bi_rw flag
with the direction of the data transfer, updates the page_states per-CPU variable to keep
track of the number of sectors read and written, and invokes the
generic_make_request( )
function on the bio
descriptor.
减少生物的使用计数器;Bio 描述符没有被释放,因为它现在被插入到 I/O 调度程序的队列中。
Decreases the usage counter of the bio; the bio descriptor is not freed, because it is now inserted in a queue of the I/O scheduler.
返回 0(成功)。
Returns 0 (success).
当 BIOS 上的 I/O 数据传输终止时,内核执行该bi_end_io方法,在本例中为end_bio_bh_io_sync( )函数。bi_private后一个函数本质上是从bio的字段中获取缓冲区头的地址
,然后调用b_end_io缓冲区头的方法(在调用之前已正确设置)submit_bh( ),最后调用bio_put( )以销毁该bio结构。
When the I/O data transfer on the bio terminates, the kernel
executes the bi_end_io method, in
this particular case the end_bio_bh_io_sync( ) function. The latter
function essentially gets the address of the buffer head from the
bi_private field of the bio, then
invokes the b_end_io method of
the buffer head—it was properly set before invoking submit_bh( )—and finally invokes bio_put( ) to destroy the bio structure.
有时内核必须同时触发多个数据块的数据传输,这些数据块不一定物理上相邻。该ll_rw_block( )
函数接收数据传输方向(本质上是READ或WRITE)、要传输的块数以及指向描述相应块缓冲区的缓冲区头的指针数组作为其参数。该函数迭代所有缓冲区头;对于它们中的每一个,它执行以下操作:
Sometimes the kernel must trigger the data transfer of
several data blocks at once, which are not necessarily physically
adjacent. The ll_rw_block( )
function receives as its parameters the direction of data transfer
(essentially READ or WRITE), the number of blocks to be
transferred, and an array of pointers to buffer heads describing the
corresponding block buffers. The function iterates over all buffer
heads; for each of them, it executes the following actions:
测试并设置BH_Lock缓冲区头的标志;如果缓冲区已经被锁定,则数据传输已被另一个内核控制路径激活,因此只需跳到步骤 9 即可跳过缓冲区。
Tests and sets the BH_Lock flag of the buffer head; if
the buffer was already locked, the data transfer has been
activated by another kernel control path, so just skips the
buffer by jumping to step 9.
b_count将缓冲区头的使用计数器加一。
Increases by one the usage counter b_count of the buffer head.
如果数据传输方向为WRITE,则设置b_end_io缓冲区头的方法指向函数地址end_buffer_write_sync( );否则,它将b_end_io方法设置为指向函数的地址end_buffer_read_sync(
)。
If the data transfer direction is WRITE, it sets the b_end_io method of the buffer head to
point to the address of the end_buffer_write_sync( ) function;
otherwise, it sets the b_end_io method to point to the
address of the end_buffer_read_sync(
) function.
如果数据传输方向为WRITE,则测试并清除
BH_Dirty缓冲区头的标志。如果未设置该标志,则无需将该块写入磁盘,因此会跳转到步骤 7。
If the data transfer direction is WRITE, it tests and clears the
BH_Dirty flag of the buffer
head. If the flag was not set, there is no need to write the
block on disk, so it jumps to step 7.
如果数据传输方向为READor READA(预读),则检查BH_Uptodate缓冲区头标志是否置位;如果是,则无需从磁盘读取该块,因此跳转到步骤 7。
If the data transfer direction is READ or READA (read-ahead), it checks whether
the BH_Uptodate flag of the
buffer head is set; if so, there is no need to read the block
from disk, so it jumps to step 7.
这里必须读取或写入块:它调用函数
submit_bh( )将缓冲区头传递到通用块层,然后跳转到步骤 9。
Here the block has to be read or written: it invokes the
submit_bh( ) function to pass
the buffer head to the generic block layer, then jumps to step
9.
通过清除标志来解锁缓冲区头BH_Lock,并唤醒每个正在等待块被解锁的进程。
Unlocks the buffer head by clearing the BH_Lock flag, and awakens every
process that was waiting for the block being unlocked.
减小b_count
缓冲区头的字段。
Decreases the b_count
field of the buffer head.
如果数组中还有其他缓冲区头需要处理,则选择下一个并跳回步骤1;否则,它终止。
If there are other buffer heads in the array to be processed, it selects the next one and jumps back to step 1; otherwise, it terminates.
请注意,如果该ll_rw_block(
)函数将缓冲区头传递到通用块层,则会锁定缓冲区并增加其引用计数器,因此在数据传输完成之前无法访问且无法释放缓冲区。当块的数据传输终止时,内核执行b_end_io缓冲区头的完成方法。假设没有 I/O 错误,end_buffer_write_sync( )和end_buffer_read_sync( )函数只需设置BH_Uptodate缓冲区头的字段、解锁缓冲区并减少其使用计数器。
Notice that if the ll_rw_block(
) function passes a buffer head to the generic block
layer, it leaves the buffer locked and its reference counter
increased, so that the buffer cannot be accessed and cannot be freed
until the data transfer completes. The kernel executes the b_end_io completion method of the buffer
head when the data transfer for the block terminates. Assuming that
there was no I/O error, the end_buffer_write_sync( ) and end_buffer_read_sync( ) functions simply
set the BH_Uptodate field of the
buffer head, unlock the buffer, and decrease its usage
counter.
[ * ]由于该private字段包含有效数据,因此PG_private该页的标志也被设置;因此,如果该页包含磁盘数据并且PG_private设置了该标志,则该页是缓冲页。但请注意,与块 I/O 子系统无关的其他内核组件将private和PG_private字段用于其他目的。
[*] Because the private field
contains valid data, the PG_private flag of the page is also set;
hence, if the page contains disk data and the PG_private flag is set, then the page is
a buffer page. Notice, however, that other kernel components not
related to the block I/O subsystem use the private and PG_private fields for other
purposes.
正如我们所看到的,内核不断用包含块设备数据的页面填充页面缓存。每当进程修改某些数据时,相应的页就会被标记为脏页,即
PG_dirty设置其标志。
As we have seen, the kernel keeps filling the page cache
with pages containing data of block devices. Whenever a process modifies
some data, the corresponding page is marked as dirty—that is, its
PG_dirty flag is set.
Unix 系统允许将脏页延迟写入块设备,因为这可以显着提高系统性能。只需对相应磁盘扇区进行一次缓慢的物理更新即可满足高速缓存中某一页上的多个写入操作。此外,写操作不如读操作那么重要,因为进程通常不会因延迟写入而挂起,而通常会因延迟读而挂起。由于延迟写入,平均而言,每个物理块设备将服务于比写入请求更多的读取请求。
Unix systems allow the deferred writes of dirty pages into block devices, because this noticeably improves system performance. Several write operations on a page in cache could be satisfied by just one slow physical update of the corresponding disk sectors. Moreover, write operations are less critical than read operations, because a process is usually not suspended due to delayed writings, while it is most often suspended because of delayed reads. Thanks to deferred writes, each physical block device will service, on the average, many more read requests than write ones.
脏页可能会保留在主内存中,直到最后一刻,即直到系统关闭。然而,将延迟写入策略推向极限有两个主要缺点:
A dirty page might stay in main memory until the last possible moment — that is, until system shutdown. However, pushing the delayed-write strategy to its limits has two major drawbacks:
如果发生硬件或电源故障,则无法再检索 RAM 的内容,因此自系统启动以来所做的许多文件更新都会丢失。
If a hardware or power supply failure occurs, the contents of RAM can no longer be retrieved, so many file updates that were made since the system was booted are lost.
页缓存的大小以及容纳它所需的 RAM 的大小必须很大 — 至少与所访问的块设备的大小一样大。
The size of the page cache, and hence of the RAM required to contain it, would have to be huge—at least as big as the size of the accessed block devices.
因此,脏页在以下条件下会刷新(写入)到磁盘:
Therefore, dirty pages are flushed (written) to disk under the following conditions:
页面缓存太满,需要更多页面,或者脏页面数量变得太大。
The page cache gets too full and more pages are needed, or the number of dirty pages becomes too large.
自从页面保持脏状态以来已经过去了太多时间。
Too much time has elapsed since a page has stayed dirty.
进程请求刷新块设备或特定文件的所有挂起更改;它通过调用sync( )、fsync(
)或系统调用来实现这一点(请参阅“ sync( )、fsync( ) 和 fdatasync( )”fdatasync( )
部分系统调用”在本章后面)。
A process requests all pending changes of a block device or of
a particular file to be flushed; it does this by invoking a sync( ), fsync(
), or fdatasync( )
system call (see the section "The sync( ), fsync( ), and
fdatasync( ) System Calls" later in this chapter).
缓冲区页带来了进一步的复杂化。与每个缓冲区页关联的缓冲区头允许内核跟踪每个单独块缓冲区的状态。PG_dirty如果至少一个相关联的缓冲区头设置了标志,则应设置缓冲区页的标志BH_Dirty。当内核选择一个脏缓冲区页进行刷新时,它会扫描关联的缓冲区头,并有效地将脏块的内容写入磁盘。一旦内核将缓冲区页面中的所有脏块刷新到磁盘,它就会清除PG_dirty该页面的标志。
Buffer pages introduce a further complication. The buffer heads
associated with each buffer page allow the kernel to keep track of the
status of each individual block buffer. The PG_dirty flag of the buffer page should be set
if at least one of the associated buffer heads has the BH_Dirty flag set. When the kernel selects a
dirty buffer page for flushing, it scans the associated buffer heads and
effectively writes to disk only the contents of the dirty blocks. As
soon as the kernel flushes all dirty blocks in a buffer page to disk, it
clears the PG_dirty flag of the
page.
早期版本的 Linux 使用称为bdflush 的内核线程 为了系统地扫描页面缓存,寻找要刷新的脏页面,他们使用了第二个名为 kupdate的内核线程 以确保没有页面保持脏状态太久。Linux 2.6 用一组称为pdflush的通用内核线程取代了它们。
Earlier versions of Linux used a kernel thread called bdflush to systematically scan the page cache looking for dirty pages to flush, and they used a second kernel thread called kupdate to ensure that no page remains dirty for too long. Linux 2.6 has replaced both of them with a group of general purpose kernel threads called pdflush.
这些内核线程具有灵活的结构。它们作用于两个参数:指向要由线程执行的函数的指针和该函数的参数。系统中pdflush内核线程的数量是动态调整的:太少时创建新线程,太多时杀死现有线程。由于这些内核线程执行的函数可能会阻塞,因此创建多个 pdflush内核线程而不是单个线程可以提高系统性能。
These kernel threads have a flexible structure. They act on two parameters: a pointer to a function to be executed by the thread and a parameter for the function. The number of pdflush kernel threads in the system is dynamically adjusted: new threads are created when they are too few and existing threads are killed when they are too many. Because the functions executed by these kernel threads can block, creating several pdflush kernel threads instead of a single one, leads to better system performance.
出生和死亡受以下规则管辖:
Births and deaths are governed by the following rules:
pdflush内核线程必须至少有两个 ,最多八个。
There must be at least two pdflush kernel threads and at most eight.
如果最后一秒没有空闲的pdflush ,则应该创建一个新的pdflush 。
If there were no idle pdflush during the last second, a new pdflush should be created.
如果自上一个pdflush空闲 以来经过了超过一秒 ,则应删除pdflush 。
If more than one second elapsed since the last pdflush became idle, a pdflush should be removed.
每个pdflush内核线程都有一个pdflush_work描述符(见表15-6)。空闲的pdflush内核线程的描述符被收集在pdflush_list列表中;自旋pdflush_lock锁可以保护该列表免受多处理器系统中的并发访问。变量
nr_pdflush_threads
[ * ]存储pdflush
内核线程的总数(空闲和繁忙)。最后,该变量存储自pdflush线程列表
变空以来的最后时间(以 jiffies 为单位) 。last_empty_jifspdflush_list
Each pdflush kernel thread has a pdflush_work descriptor (see Table 15-6). The
descriptors of idle pdflush kernel threads are
collected in the pdflush_list list;
the pdflush_lock spin lock protects
that list from concurrent accesses in multiprocessor systems. The
nr_pdflush_threads
variable[*] stores the total number of pdflush
kernel threads (idle and busy). Finally, the last_empty_jifs variable stores the last
time (in jiffies) since the pdflush_list list of
pdflush threads became empty.
表 15-6。pdflush_work 描述符的字段
Table 15-6. The fields of the pdflush_work descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向内核线程描述符的指针 Pointer to kernel thread descriptor |
| | 内核线程执行的回调函数 Callback function to be executed by the kernel thread |
| | 回调函数的参数 Argument to callback function |
| |
Links for the |
无符号长 unsigned long | | 内核线程可用的时间(以 jiffies 为单位) Time in jiffies when kernel thread became available |
每个pdflush内核线程都会执行该
_ _pdflush( )函数,该函数本质上是无限循环,直到内核线程死亡。假设pdflush内核线程空闲;然后,该进程处于休眠TASK_INTERRUPTIBLE状态。一旦内核线程被唤醒,_ _pdflush(
)就访问其pdflush_work描述符并执行存储在字段中的回调函数fn
,并将存储在字段中的参数传递给它arg0。当回调函数终止时,_ _pdflush( )检查变量的值last_empty_jifs
:是否有空闲的pdflush内核线程超过一秒,如果pdflush内核线程少于 8 个
,_ _pdflush( )则启动另一个内核线程。否则,如果列表中的最后一个条目pdflush_list空闲超过一秒,并且有两个以上的pdflush
内核线程,则终止:如第 3 章“内核线程”_ _pdflush( )
部分所述,相应的内核线程执行_exit( )
系统调用,它因此被破坏。否则,
将内核线程的描述符_ _pdflush( )重新插入
列表中,并使内核线程进入睡眠状态。pdflush_workpdflush_list
Each pdflush kernel thread executes the
_ _pdflush( ) function, which
essentially loops in an endless cycle until the kernel thread dies.
Let's suppose that the pdflush kernel thread is
idle; then, the process is sleeping in TASK_INTERRUPTIBLE state. As soon as the
kernel thread is woken up, _ _pdflush(
) accesses its pdflush_work descriptor and executes the
callback function stored in the fn
field, passing to it the argument stored in the arg0 field. When the callback function
terminates, _ _pdflush( ) checks
the value of the last_empty_jifs
variable: if there was no idle pdflush kernel
thread for more than one second and if there are less than eight
pdflush kernel threads, _ _pdflush( ) starts another kernel thread.
Otherwise, if the last entry in the pdflush_list list is idle for more than one
second, and there are more than two pdflush
kernel threads, _ _pdflush( )
terminates: as explained in the section "Kernel Threads" in Chapter 3, the corresponding
kernel thread executes the _exit( )
system call and it is thus destroyed. Otherwise,
_ _pdflush( ) reinserts the
pdflush_work descriptor of the
kernel thread in the pdflush_list
list and puts the kernel thread to sleep.
该pdflush_operation( )
函数用于激活空闲的pdflush
内核线程。该函数作用于两个参数:一个指向
fn必须执行的函数的指针和一个参数arg0;它执行以下步骤:
The pdflush_operation( )
function is used to activate an idle pdflush
kernel thread. This function acts on two parameters: a pointer
fn to the function that must be
executed and an argument arg0; it
performs the following steps:
从列表中提取指向空闲
pdflush内核线程描述符的pdflush_list指针。如果列表为空,则返回1。如果列表仅包含一个元素,则将变量的值设置为。pdfpdflush_work-last_empty_jifsjiffies
Extracts from the pdflush_list list a pointer pdf to the pdflush_work descriptor of an idle
pdflush kernel thread. If the list is empty,
it returns -1. If the list
contained just one element, it sets the value of the last_empty_jifs variable to jiffies.
存储在pdf->fn和 中的pdf->arg0参数
fn和arg0。
Stores in pdf->fn and
in pdf->arg0 the parameters
fn and arg0.
Invokes wake_up_process(
) to wake up the idle pdflush
kernel thread, that is, pdf->who.
哪些类型的作业被委托给 pdflush内核线程?其中有一些,都与脏数据的刷新有关。特别是, pdflush通常执行以下回调函数之一:
What kinds of jobs are delegated to the pdflush kernel threads? There are a few of them, all related to flushing of dirty data. In particular, pdflush usually executes one of the following callback functions:
background_writeout( ):系统地遍历页面缓存以查找要刷新的脏页(请参阅下一节“查找要刷新的脏页”)。
background_writeout( ):
systematically walks the page cache looking for dirty pages to be
flushed (see the next section "Looking for Dirty Pages To
Be Flushed").
wb_kupdate( ):检查页面缓存中没有页面保持脏状态时间过长(请参阅本章后面的“检索旧脏页面”部分)。
wb_kupdate( ): checks
that no page in the page cache remains dirty for too long (see the
section "Retrieving
Old Dirty Pages" later in this chapter).
每个基数树都可能包含要刷新的脏页。address_space因此,检索所有这些对象需要在与磁盘上具有映像的索引节点关联的所有对象中进行详尽的搜索。由于页面缓存可能包含大量页面,因此在单次运行中扫描整个缓存可能会使 CPU 和磁盘长时间处于繁忙状态。因此,Linux 采用了一种复杂的机制,将页面缓存扫描分成多次执行。
Every radix tree could include dirty pages to be
flushed. Retrieving all of them thus involves an exhaustive search
among all address_space objects
associated with inodes having an image on disk. Because the page cache
might include a large number of pages, scanning the whole cache in a
single run might keep the CPU and the disks busy for a long time.
Therefore, Linux adopts a sophisticated mechanism that splits the page
cache scanning in several runs of execution.
该wakeup_bdflush( )
函数接收页缓存中应刷新的脏页数作为参数;值零意味着缓存中的所有脏页都应写回磁盘。该函数调用pdflush_operation( )唤醒pdflush内核线程(请参阅上一节)并将回调函数的执行委托给它background_writeout( )。后一个函数有效地从页缓存中检索指定数量的脏页并将它们写回磁盘。
The wakeup_bdflush( )
function receives as argument the number of dirty pages in the page
cache that should be flushed; the value zero means that all dirty
pages in the cache should be written back to disk. The function
invokes pdflush_operation( ) to
wake up a pdflush kernel thread (see the previous
section) and delegate to it the execution of the background_writeout( ) callback function.
The latter function effectively retrieves the specified number of
dirty pages from the page cache and writes them back to disk.
wakeup_bdflush( )
当内存不足或用户明确请求刷新操作时,将执行该函数。特别是,该函数在以下情况下被调用:
The wakeup_bdflush( )
function is executed when either memory is scarce or a user makes an
explicit request for a flush operation. In particular, the function is
invoked when:
用户态进程发出一个sync( ) 系统调用(请参阅本章后面的“ sync( )、fsync( ) 和 fdatasync( ) 系统调用”部分)。
The User Mode process issues a sync( ) system call (see the section "The sync( ), fsync( ), and
fdatasync( ) System Calls" later in this chapter).
该grow_buffers( )
函数无法分配新的缓冲区页(请参阅前面的“分配块设备缓冲区页”部分)。
The grow_buffers( )
function fails to allocate a new buffer page (see the earlier
section "Allocating
Block Device Buffer Pages").
页框回收算法调用free_more_memory( )or try_to_free_pages( )(参见第17章)。
The page frame reclaiming algorithm invokes free_more_memory( ) or try_to_free_pages( ) (see Chapter 17).
The mempool_alloc( )
function fails to allocate a new memory pool element (see the
section "Memory
Pools" in Chapter
8).
此外,执行回调函数的pdflushbackground_writeout( )内核线程会被每个修改页面缓存中页面内容的进程唤醒,并导致脏页面比例上升到某个脏背景阈值以上。背景阈值通常设置为系统中所有页面的 10%,但可以通过写入/proc/sys/vm/dirty_background_ratio
文件来调整其值。
Moreover, a pdflush kernel thread executing
the background_writeout( ) callback
function is woken up by every process that modifies the contents of
pages in the page cache and causes the fraction of dirty pages to rise
above some dirty background threshold. The
background threshold is typically set to 10% of all pages in the
system, but its value can be adjusted by writing in the /proc/sys/vm/dirty_background_ratio
file.
该background_writeout( )
函数依赖于一个writeback_control结构,该结构充当双向通信设备:一方面,它告诉一个称为“做什么”的辅助函数writeback_inodes( )
;另一方面,它告诉一个辅助函数“做什么”。另一方面,它存储一些有关写入磁盘的页数的统计信息。该结构中最重要的字段如下:
The background_writeout( )
function relies on a writeback_control structure, which acts as a
two-way communication device: on one hand, it tells an auxiliary
function called writeback_inodes( )
what to do; on the other hand, it stores some statistics about the
number of pages written to disk. The most important fields of this
structure are the following:
sync_modesync_mode指定同步模式:WB_SYNC_ALL意味着如果遇到锁定的inode,必须等待而不是直接跳过;WB_SYNC_HOLD
意味着锁定的索引节点被放入列表中以供稍后考虑;并WB_SYNC_NONE意味着锁定的索引节点将被简单地跳过。
Specifies the synchronization mode: WB_SYNC_ALL means that if a locked
inode is encountered, it must be waited upon and not just
skipped over; WB_SYNC_HOLD
means that locked inodes are put in a list for later
consideration; and WB_SYNC_NONE means that locked inodes
are simply skipped.
bdibdi如果不是NULL,则指向一个backing_dev_info
结构体;在这种情况下,只有属于底层块设备的脏页才会被刷新。
If not NULL, it points
to a backing_dev_info
structure; in this case, only dirty pages belonging to the
underlying block device will be flushed.
older_than_thisolder_than_this如果不为空,则意味着应跳过比指定值更年轻的索引节点。
If not null, it means that inodes younger than the specified value should be skipped.
nr_to_writenr_to_write本次执行运行中尚未写入的脏页数。
Number of dirty pages yet to be written in this run of execution.
nonblockingnonblocking如果设置了该标志,则无法阻止该进程。
If this flag is set, the process cannot be blocked.
该background_writeout( )
函数作用于单个参数:nr_pages,应刷新到磁盘的最小页数。它本质上执行以下步骤:
The background_writeout( )
function acts on a single parameter: nr_pages, the minimum number of pages that
should be flushed to disk. It essentially executes the following
steps:
从每个 CPU 变量中读取page_state
当前存储在页缓存中的页数和脏页数。如果脏页的比例低于给定阈值并且至少nr_pages已刷新到磁盘,则该函数终止。该阈值一般设置为系统页面数量的40%左右;可以通过写入/proc/sys/vm/dirty_ratio文件来调整。
Reads from the page_state
per-CPU variable the number of pages and dirty pages currently
stored in the page cache. If the fraction of dirty pages is below
a given threshold and at least nr_pages have been flushed to disk, the
function terminates. The value of this threshold is typically set
to about 40% of the number of pages in the system; it could be
adjusted by writing into the /proc/sys/vm/dirty_ratio file.
调用writeback_inodes(
)尝试写入 1, 024 个脏页(见下文)。
Invokes writeback_inodes(
) to try to write 1, 024 dirty pages (see below).
检查有效写入的页数并减少尚未写入的页数。
Checks the number of pages effectively written and decreases the number of pages yet to be written.
如果写入的页数少于 1,024 个或已跳过页,则块设备的请求队列可能已拥塞:该函数将当前进程置于特殊的等待队列中休眠 100 毫秒或直到队列变得不拥塞。
If less than 1,024 pages have been written or if pages have been skipped, probably the request queue of the block device is congested: the function puts the current process to sleep in a special wait queue for 100 milliseconds or until the queue becomes uncongested.
返回步骤 1。
Goes back to step 1.
该writeback_inodes( )
函数作用于单个参数,即wbc指向writeback_control描述符的指针。该描述符的字段nr_to_write包含要刷新到磁盘的页数。当函数返回时,同一字段包含剩余要刷新的页数;如果一切顺利,该字段将设置为 0。
The writeback_inodes( )
function acts on a single parameter, namely a pointer wbc to a writeback_control descriptor. The nr_to_write field of this descriptor
contains the number of pages to be flushed to disk. When the function
returns, the same field contains the number of pages remaining to be
flushed; if everything went smoothly, this field will be set to
0.
让我们假设writeback_inodes(
)和wbc->bdi指针wbc->older_than_this设置为
NULL、WB_SYNC_NONE同步模式和
wbc->nonblocking标志集(这些是由 设置的值)来调用background_writeout(
)。该函数扫描以变量为根的超级块列表
(请参阅第 12 章中的“超级块对象”super_blocks部分)。当遍历整个列表或达到要刷新的目标页数时,扫描结束。对于每个超级块,该函数执行以下步骤:sb
Let us suppose that writeback_inodes(
) is called with the wbc->bdi and wbc->older_than_this pointers set to
NULL, the WB_SYNC_NONE synchronization mode, and the
wbc->nonblocking flag set—these
are the values set by background_writeout(
). The function scans the list of superblocks rooted at the
super_blocks variable (see the
section "Superblock
Objects" in Chapter
12). The scanning ends when either the whole list has been
traversed, or the target number of pages to be flushed has been
reached. For each superblock sb,
the function executes the following steps:
检查sb->s_dirty或sb->s_io列表是否为空:第一个列表收集超级块的脏 inode,而第二个列表收集等待传输到磁盘的 inode(见下文)。如果两个列表都是空的,则该文件系统上的 inode 没有脏页,因此该函数会考虑列表中的下一个超级块。
Checks whether the sb->s_dirty or sb->s_io lists are empty: the first
list collects the dirty inodes of the superblock, while the second
list collects the inodes waiting to be transferred to disk (see
below). If both lists are empty, the inodes on this filesystem
have no dirty pages, so the function considers the next superblock
in the list.
这里的超级块有脏索引节点。调用超级sync_sb_inodes( )块sb。这个功能:
将 的所有 inodesb->s_dirty放入 指向的列表中sb->s_io,并清除脏 inode 列表。
从获取下一个inode
指针sb->s_io。如果此列表为空,则返回。
如果启动后inode脏sync_sb_inodes( )页,则跳过该inode的脏页并返回。请注意,一些脏索引节点可能会保留在sb->s_io列表中。
如果当前进程是pdflush
内核线程,它检查另一个 CPU 上运行的另一个
pdflush内核线程是否已经在尝试刷新属于该块设备的文件的脏页。BDI_pdflush这可以通过对inode 的标志
进行原子测试和设置操作来完成backing_dev_info。本质上,在同一个请求队列上拥有多个
pdflush内核线程是没有意义的(请参阅本章前面的“ pdflush 内核线程”一节)。
将 inode 的使用计数器增加 1。
调用_
_writeback_single_inode( )写回与所选 inode 关联的脏缓冲区:
如果 inode 被锁定,它将移入inode脏 inode 列表 ( inode->i_sb->s_dirty) 并返回 0。(因为我们假设该wbc->sync_mode字段不是
WB_SYNC_ALL,所以该函数不会阻塞等待 inode 解锁。)
使用writepagesinode 地址空间的方法或mpage_writepages( )函数(如果不存在此类方法)写入wbc->nr_to_write脏页。该函数使用该find_get_pages_tag( )函数快速检索 inode 地址空间中的所有脏页(请参阅本章前面的“基数树的标签”部分)。详细内容将在下一章中给出。
如果inode脏了,它使用超级块的
write_inode方法将inode写入磁盘。实现此方法的函数通常依赖于传输单个数据块(请参阅本章前面的“将缓冲区头提交到通用块层submit_bh(
)”部分)。
检查inode的状态;sb->s_dirty相应地,如果 inode 的某些页面仍然是脏的,则将inode 移回到列表中,或者inode_unused如果 inode 的引用计数器为零,则将 inode 移回列表中,否则移回列表中(请参阅第 12 章中的“ Inode 对象”inode_in_use部分)。
返回步骤 2f2 中调用的函数的错误代码。
回到sync_sb_inodes(
)函数中。如果当前进程是
pdflush 内核线程,它清除BDI_pdflush步骤 2d 中设置的标志。
如果在刚刚处理的 inode 中跳过了某些页面,则 inode 包括锁定的缓冲区:将列表中剩余的所有 inode 移sb->s_io回到列表中
sb->s_dirty:稍后将重新考虑它们。
将 inode 的使用计数器减 1。
如果wbc->nr_to_write大于0,则返回步骤2b寻找同一超级块的其他脏inode。否则,sync_sb_inodes( )函数终止。
Here the superblock has dirty inodes. Invokes sync_sb_inodes( ) on the sb superblock. This function:
Puts all the inodes of sb->s_dirty into the list pointed
to by sb->s_io and
clears the list of dirty inodes.
Gets the next inode
pointer from sb->s_io.
If this list is empty, it returns.
If the inode was dirtied after sync_sb_inodes( ) started, it skips
the inode's dirty pages and returns. Notice that some dirty
inodes might remain in the sb->s_io list.
If the current process is a pdflush
kernel thread, it checks whether another
pdflush kernel thread running on another
CPU is already trying to flush dirty pages for files belonging
to this block device. This can be done by an atomic test and
set operation on the BDI_pdflush flag of the inode's
backing_dev_info.
Essentially, it is pointless to have more than one
pdflush kernel thread on the same request
queue (see the section "The pdflush Kernel
Threads" earlier in this chapter).
Increases by one the inode's usage counter.
Invokes _
_writeback_single_inode( ) to write back the dirty
buffers associated with the selected inode:
If the inode is locked, it moves inode into the list of dirty
inodes (inode->i_sb->s_dirty) and
returns 0. (Since we are assuming that the wbc->sync_mode field is not
WB_SYNC_ALL, the
function does not block waiting for the inode to
unlock.)
Uses the writepages method of the inode's
address space, or the mpage_writepages( ) function if
no such method exists, to write up to wbc->nr_to_write dirty pages.
This function uses the find_get_pages_tag( ) function
to retrieve quickly all dirty pages in the inode's address
space (see the section "The Tags of the
Radix Tree" earlier in this chapter). Details will
be given in the next chapter.
If the inode is dirty, it uses the superblock's
write_inode method to
write the inode to disk. The functions that implement this
method usually rely on submit_bh(
) to transfer a single block of data (see the
section "Submitting Buffer
Heads to the Generic Block Layer" earlier in this
chapter).
Checks the status of the inode; accordingly, moves
the inode back into the sb->s_dirty list if some page
of the inode is still dirty, or in the inode_unused list if the inode's
reference counter is zero, or in the inode_in_use list otherwise (see
the section "Inode
Objects" in Chapter 12).
Returns the error code of the function invoked in step 2f2.
Back into the sync_sb_inodes(
) function. If the current process is the
pdflush kernel thread, it clears the BDI_pdflush flag set in step
2d.
If some pages were skipped in the inode just processed,
then the inode includes locked buffers: moves all inodes
remaining in the sb->s_io list back into the
sb->s_dirty list: they
will be reconsidered at a later time.
Decreases by one the usage counter of the inode.
If wbc->nr_to_write is greater than
0, goes back to step 2b to look for other dirty inodes of the
same superblock. Otherwise, the sync_sb_inodes( ) function
terminates.
回到writeback_inodes(
)函数中。如果wbc->nr_to_write大于零,则跳转到步骤 1 并继续处理全局列表中的下一个超级块。否则,它会返回。
Back into the writeback_inodes(
) function. If wbc->nr_to_write is greater than
zero, it jumps to step 1 and continues with the next superblock in
the global list. Otherwise, it returns.
如前所述,内核试图避免某些页面长时间未刷新时发生的饥饿风险。因此,如果页面在预定义的时间内保持脏状态,内核会显式启动 I/O 数据传输,将其内容写入磁盘。
As stated earlier, the kernel tries to avoid the risk of starvation that occurs when some pages are not flushed for a long period of time. Hence, if a page remains dirty for a predefined amount of time, the kernel explicitly starts an I/O data transfer that writes its contents to disk.
检索旧脏页的工作被委托给
定期唤醒的pdflush内核线程。在内核初始化期间,该page_writeback_init( )函数设置
wb_timer动态计时器,使其在dirty_writeback_centisecs数百秒后衰减(通常为 500,但可以通过写入
/proc/sys/vm/dirty_writeback_centisecs
文件来调整该值)。被调用的计时器函数wb_timer_fn( )本质上是调用
pdflush_operation( )向其传递wb_kupdate(
)回调函数地址的函数。
The job of retrieving old dirty pages is delegated to a
pdflush kernel thread that is periodically woken
up. During the kernel initialization, the page_writeback_init( ) function sets up the
wb_timer dynamic timer so that it
decays after dirty_writeback_centisecs hundreds of a
second (usually 500, but this value can be adjusted by writing in the
/proc/sys/vm/dirty_writeback_centisecs
file). The timer function, which is called wb_timer_fn( ), essentially invokes the
pdflush_operation( ) function
passing to it the address of the wb_kupdate(
) callback function.
该wb_kupdate( )函数遍历页面缓存寻找“旧”脏索引节点;它执行以下步骤:
The wb_kupdate( ) function
walks the page cache looking for "old" dirty inodes; it executes the
following steps:
调用sync_supers(
)函数将脏超级块写入磁盘(请参阅下一节)。尽管与页面缓存中页面的刷新没有严格关系,但此调用可确保没有超级块保持脏状态的时间通常超过五秒。
Invokes the sync_supers(
) function to write the dirty superblocks to disk (see
the next section). Although not strictly related to the flushing
of the pages in the page cache, this invocation ensures that no
superblock remains dirty for more than, usually, five
seconds.
older_than_this在描述符的字段中存储writeback_control一个指向 jiffies 值的指针,该值对应于当前时间减去 30 秒。三十秒是允许页面保持脏状态的最长时间。
Stores in the older_than_this field of a writeback_control descriptor a pointer
to a value in jiffies corresponding to the current time minus 30
seconds. Thirty seconds is the longest time for which a page is
allowed to remain dirty.
根据每个 CPUpage_state变量确定页面缓存中当前脏页的粗略数量。
Determines from the per-CPU page_state variable the rough number of
dirty pages currently in the page cache.
重复调用,writeback_inodes( )直到写入磁盘的页数达到上一步中确定的值,或者所有早于 30 秒的页都已写入。在此周期中,如果某些请求队列变得拥塞,该函数可能会休眠。
Invokes repeatedly writeback_inodes( ) until either the
number of pages written to disk reaches the value determined in
the previous step, or all pages older than 30 seconds have been
written. During this cycle the function might sleep if some
request queue becomes congested.
用于mod_timer( )重新启动动态计时器:自调用此函数以来,wb_timer它将再次衰减数百秒(如果此执行持续太长时间,则从现在开始衰减一秒)。dirty_writeback_centisecs
Uses mod_timer( ) to
restart the wb_timer dynamic
timer: it will decay once again dirty_writeback_centisecs hundreds of
seconds since the invocation of this function (or one second since
now if this execution lasted too long).
在本节中,我们将简要研究用户应用程序可用于将脏缓冲区刷新到磁盘的三个系统调用:
In this section, we examine briefly the three system calls available to user applications to flush dirty buffers to disk:
sync( )sync( )允许进程将所有脏缓冲区刷新到磁盘
Allows a process to flush all dirty buffers to disk
fsync( )fsync( )允许进程将属于特定打开文件的所有块刷新到磁盘
Allows a process to flush all blocks that belong to a specific open file to disk
fdatasync( )fdatasync( )与 非常相似fsync( ),但不会刷新文件的 inode 块
Very similar to fsync( ),
but doesn't flush the inode block of the file
sys_sync(
)的服务程序sync( )
系统调用调用一系列辅助函数:
The service routine sys_sync(
) of the sync( )
system call invokes a series of auxiliary
functions:
唤醒_bdflush(0);
同步索引节点(0);
同步超级();
同步文件系统(0);
同步文件系统(1);
同步索引节点(1); wakeup_bdflush(0);
sync_inodes(0);
sync_supers( );
sync_filesystems(0);
sync_filesystems(1);
sync_inodes(1);如上一节所述,wakeup_bdflush( )启动pdflush 内核线程,它将页面缓存中包含的所有脏页面刷新到磁盘。
As described in the previous section, wakeup_bdflush( ) starts a pdflush kernel thread, which flushes to disk all dirty pages
contained in the page cache.
该sync_inodes( )函数扫描超级块列表,寻找要刷新的脏索引节点;它作用于一个wait参数,该参数指定是否必须等待直到执行刷新。该函数扫描所有当前挂载的文件系统的超级块;对于每个包含脏索引节点的超级块,sync_inodes( )首先调用刷新相应的脏页(我们在前面的“查找要刷新的脏页sync_sb_inodes( )”部分中描述了此函数),然后调用显式刷新该块设备拥有的脏缓冲区页包括超级块。这样做是因为
sync_blockdev(
)write_inode许多基于磁盘的文件系统的 superblock 方法只是将磁盘 inode 对应的块缓冲区标记为脏;该sync_blockdev( )函数确保所做的更新sync_sb_inodes(
)有效地写入磁盘。
The sync_inodes( ) function
scans the list of superblocks looking for dirty inodes to be flushed;
it acts on a wait parameter that
specifies whether it must wait until flushing has been performed or
not. The function scans the superblocks of all currently mounted
filesystems; for each superblock containing dirty inodes, sync_inodes( ) first invokes sync_sb_inodes( ) to flush the corresponding
dirty pages (we described this function earlier in the section "Looking for Dirty Pages To Be
Flushed"), then invokes sync_blockdev(
) to explicitly flush the dirty buffer pages owned by the
block device that includes the superblock. This is done because the
write_inode superblock method of
many disk-based filesystems simply marks the block buffer
corresponding to the disk inode as dirty; the sync_blockdev( ) function makes sure that
the updates made by sync_sb_inodes(
) are effectively written to disk.
如有必要,该sync_supers( )函数通过使用正确的超级块操作将脏超级块写入磁盘write_super。最后,
对所有可写文件系统sync_filesystems(
)执行超级块方法。sync_fs这个方法只是提供给文件系统的一个钩子,以防它需要在每次同步时执行一些特殊的操作;此方法仅由日志文件系统使用例如 Ext3(参见第 18 章)。
The sync_supers( ) function
writes the dirty superblocks to disk, if necessary, by using the
proper write_super superblock
operations. Finally, the sync_filesystems(
) executes the sync_fs
superblock method for all writable filesystems. This method is simply
a hook offered to a filesystem in case it needs to perform some
peculiar operation at each sync; this method is only used by
journaling filesystems such as Ext3 (see Chapter 18).
请注意,sync_inodes( )
和sync_filesystems( )被调用两次,一次wait参数等于 0,第二次参数等于 1。这是故意这样做的:首先,它们快速将未锁定的 inode 刷新到磁盘;然后 接下来,他们等待每个锁定的索引节点解锁并一一完成写入。
Notice that sync_inodes( )
and sync_filesystems( ) are invoked
twice, once with the wait parameter
equal to 0 and the second time with the parameter equal to 1. This is
done on purpose: first, they quickly flush to disk the unlocked
inodes; next, they wait for each locked inode to become unlocked and
finish writing them one by one.
该fsync( )系统调用强制内核将属于文件描述符参数指定的文件的所有脏缓冲区写入磁盘fd
(如果需要,包括包含其 inode 的缓冲区)。相应的服务例程导出文件对象的地址,然后调用该fsync方法。通常,此方法最终会调用_ _writeback_single_inode(
)函数来写回与所选 inode 关联的脏页和 inode 本身(请参阅本章前面的“查找要刷新的脏页”部分)。
The fsync( ) system
call forces the kernel to write to disk all dirty buffers that belong
to the file specified by the fd
file descriptor parameter (including the buffer containing its inode,
if necessary). The corresponding service routine derives the address
of the file object and then invokes the fsync method. Usually, this method ends up
invoking the _ _writeback_single_inode(
) function to write back both the dirty pages associated
with the selected inode and the inode itself (see the section "Looking for Dirty Pages To Be
Flushed" earlier in this chapter).
该fdatasync( )系统调用与 非常相似fsync( ),但仅将包含文件数据的缓冲区写入磁盘,而不将包含索引节点信息的缓冲区写入磁盘。由于 Linux 2.6 没有特定的文件方法fdatasync(
),因此该系统调用使用该fsync方法,因此与
fsync( ).
The fdatasync( ) system call
is very similar to fsync( ), but
writes to disk only the buffers that contain the file's data, not
those that contain inode information. Because Linux 2.6 does not have
a specific file method for fdatasync(
), this system call uses the fsync method and is thus identical to
fsync( ).
访问基于磁盘的文件是一项复杂的活动,涉及 VFS 抽象层(第 12 章)、处理块设备(第 14 章)以及页缓存的使用(第 15 章)。本章展示了内核如何构建在所有这些设施的基础上来执行文件读取和写入。本章涵盖的主题既适用于存储在基于磁盘的文件系统中的常规文件,也适用于块设备文件;这两种文件将简称为“文件”。
Accessing a disk-based file is a complex activity that involves the VFS abstraction layer (Chapter 12), handling block devices (Chapter 14), and the use of the page cache (Chapter 15). This chapter shows how the kernel builds on all those facilities to carry out file reads and writes. The topics covered in this chapter apply both to regular files stored in disk-based filesystems and to block device files; these two kinds of files will be referred to simply as "files."
我们在本章中工作的阶段在调用特定文件的正确读取或写入方法之后开始(如 第 12 章所述)。我们在这里展示每次读取如何以传递到用户模式进程的所需数据结束,以及每次写入如何以标记为准备传输到磁盘的数据结束。其余的传输由第 14 章和第 15 章中描述的设施处理。
The stage we are working at in this chapter starts after the proper read or write method of a particular file has been called (as described in Chapter 12). We show here how each read ends with the desired data delivered to a User Mode process and how each write ends with data marked ready for transfer to disk. The rest of the transfer is handled by the facilities described in Chapter 14 and Chapter 15.
有许多不同的方式来访问文件。在本章中,我们将考虑以下情况:
There are many different ways to access a file. In this chapter we will consider the following cases:
文件在清除O_SYNC和O_DIRECT标志的情况下打开,并通过read( )
和write( )系统调用访问其内容。在这种情况下,read( )系统调用会阻塞调用进程,直到数据被复制到用户模式地址空间中(但是,内核始终允许返回比请求的字节少的字节!)。系统write( )调用有所不同,因为一旦数据复制到页面缓存(延迟写入),它就会终止。“读取和写入文件”部分介绍了这种情况。
The file is opened with the O_SYNC and O_DIRECT flags cleared, and its content is
accessed by means of the read( )
and write( ) system calls. In
this case, the read( ) system
call blocks the calling process until the data is copied into the
User Mode address space (however, the kernel is always allowed to
return fewer bytes than requested!). The write( ) system call is different, because
it terminates as soon as the data is copied into the page cache
(deferred write). This case is covered in the section "Reading and Writing a
File."
该文件是在O_SYNC设置了标志的情况下打开的,或者该标志是在稍后的时间由fcntl( )
系统调用。该标志仅影响写操作(读操作始终是阻塞的),它会阻塞调用进程,直到数据有效写入磁盘。“读取和写入文件”部分也涵盖了这种情况。
The file is opened with the O_SYNC flag set—or the flag is set at a
later time by the fcntl( )
system call. This flag affects only the write
operation (read operations are always blocking), which blocks the
calling process until the data is effectively written to disk. The
section "Reading and
Writing a File" covers this case, too.
打开文件后,应用程序会发出一个mmap( ) 将文件映射到内存的系统调用。结果,文件在 RAM 中显示为字节数组,应用程序直接访问数组元素,而不是使用read( ) 、write( )、 或
lseek( )。这种情况将在“内存映射”部分中讨论。
After opening the file, the application issues an mmap( ) system call to map the file into memory. As a result,
the file appears as an array of bytes in RAM, and the application
accesses directly the array elements instead of using read( ) , write( ), or
lseek( ). This case is discussed
in the section "Memory
Mapping."
文件以O_DIRECT设置的标志打开。任何读取或写入操作都会绕过页面缓存,直接将数据从用户模式地址空间传输到磁盘,反之亦然。我们在“直接 I/O 传输”部分讨论这种情况。O_SYNC(和
标志的值O_DIRECT可以通过四种有意义的方式组合。)
The file is opened with the O_DIRECT flag set. Any read or write
operation transfers data directly from the User Mode address space
to disk, or vice versa, bypassing the page cache. We discuss this
case in the section "Direct I/O Transfers."
(The values of the O_SYNC and
O_DIRECT flags can be combined in
four meaningful ways.)
通过一组 POSIX API 或特定于 Linux 的系统调用来访问文件,以执行“异步 I/O”:这意味着数据传输请求永远不会阻塞调用进程;相反,它们是在应用程序继续正常执行时“在后台”进行的。我们在“异步 I/O ”部分讨论这种情况。
The file is accessed—either through a group of POSIX APIs or by means of Linux-specific system calls—in such a way to perform "asynchronous I/O:" this means the requests for data transfers never block the calling process; rather, they are carried on "in the background" while the application continues its normal execution. We discuss this case in the section "Asynchronous I/O."
第12章的“ read()和write()系统调用”部分
描述了read()和write()系统调用是如何实现
的。相应的服务例程最终调用文件对象和方法,这可能依赖于文件系统。对于基于磁盘的文件系统,这些方法定位包含正在访问的数据的物理块,并激活块设备驱动程序以开始数据传输。read( )write( )readwrite
The section "The read( ) and write( ) System
Calls" in Chapter 12
described how the read( ) and
write( ) system calls are
implemented. The corresponding service routines end up invoking the file
object's read and write methods, which may be
filesystem-dependent. For disk-based filesystems, these methods locate
the physical blocks that contain the data being accessed and activate
the block device driver to start the data transfer.
读取文件是基于页的:内核总是一次传输整页数据。如果进程发出系统read( )调用来获取几个字节,并且该数据尚未在 RAM 中,则内核会分配一个新的页框,用文件的适当部分填充该页,将该页添加到页缓存中,然后最后将请求的字节复制到进程地址空间中。对于大多数文件系统来说,从文件中读取一页数据只需查找磁盘上的哪些块包含所请求的数据即可。一旦完成,内核就会通过向通用块层提交适当的 I/O 操作来填充页面。实际上,
read所有基于磁盘的文件系统的方法都是通过一个名为 的通用函数来实现的generic_file_read( )。
Reading a file is page-based: the kernel always transfers whole
pages of data at once. If a process issues a read( ) system call to get a few bytes, and
that data is not already in RAM, the kernel allocates a new page frame,
fills the page with the suitable portion of the file, adds the page to
the page cache, and finally copies the requested bytes into the process
address space. For most filesystems, reading a page of data from a file
is just a matter of finding what blocks on disk contain the requested
data. Once this is done, the kernel fills the pages by submitting the
proper I/O operations to the generic block layer. In practice, the
read method of all disk-based
filesystems is implemented by a common function named generic_file_read( ).
基于磁盘的文件的写操作处理起来稍微复杂一些,因为文件大小可能会增加,因此内核可能会在磁盘上分配一些物理块。当然,具体如何完成取决于文件系统类型。然而,许多基于磁盘的文件系统write通过名为 的通用函数来实现其方法generic_file_write( )。此类文件系统的示例包括 Ext2、System V/相干/Xenix和迷你。另一方面,其他几个文件系统,例如日志文件系统和网络文件系统write
,通过自定义函数的方式实现该方法。
Write operations on disk-based files are slightly more complicated
to handle, because the file size could increase, and therefore the
kernel might allocate some physical blocks on the disk. Of course, how
this is precisely done depends on the filesystem type. However, many
disk-based filesystems implement their write methods by means of a common function
named generic_file_write( ). Examples
of such filesystems are Ext2, System V /Coherent /Xenix , and MINIX . On the other hand, several other filesystems, such as
journaling and network filesystems , implement the write
method by means of custom functions.
该generic_file_read(
)函数用于实现read块设备文件和几乎所有基于磁盘的文件系统的常规文件的方法。该函数作用于以下参数:
The generic_file_read(
) function is used to implement the read method for block device files and for
regular files of almost all disk-based filesystems. This function acts
on the following parameters:
filpfilp文件对象的地址
Address of the file object
bufbuf必须存储从文件读取的字符的用户模式存储区域的线性地址
Linear address of the User Mode memory area where the characters read from the file must be stored
countcount要读取的字符数
Number of characters to be read
pposppos指向变量的指针,该变量存储必须开始读取的偏移量(通常是文件对象f_pos的字段filp)
Pointer to a variable that stores the offset from which
reading must start (usually the f_pos field of the filp file object)
第一步,该函数初始化两个描述符。第一个描述符存储在local_iov类型为 的局部变量中iovec;它包含应接收从文件读取的数据的用户模式缓冲区的地址 ( buf) 和长度 ( )。count第二个描述符存储在kiocb类型为 的
局部变量中kiocb;它用于跟踪正在进行的同步或异步 I/O 操作的完成状态。描述符的主要字段如表16-1kiocb所示。
As a first step, the function initializes two descriptors. The
first descriptor is stored in the local variable local_iov of type iovec; it contains the address (buf) and the length (count) of the User Mode buffer that shall
receive the data read from the file. The second descriptor is stored
in the local variable kiocb of type
kiocb; it is used to keep track of
the completion status of an ongoing synchronous or asynchronous I/O
operation. The main fields of the kiocb descriptor are shown in Table 16-1.
表 16-1。kiocb 描述符的主要字段
Table 16-1. The main fields of the kiocb descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 稍后重试的 I/O 操作列表的指针 Pointers for the list of I/O operations to be retried later |
| |
Flags of the |
| |
Usage counter of the |
| | 异步 I/O 操作的标识符,或同步 I/O 操作的 Identifier of the asynchronous I/O
operation, or |
结构文件* struct file * | 基菲尔普 ki_filp | 指向与正在进行的 I/O 操作关联的文件对象的指针 Pointer to the file object associated with the ongoing I/O operation |
结构 kioctx * struct kioctx * | ki_ctx ki_ctx | 指向此操作的异步 I/O 上下文描述符的指针(请参阅本章后面的“异步 I/O ”部分) Pointer to the asynchronous I/O context descriptor for this operation (see the section "Asynchronous I/O" later in this chapter) |
整数 int
| ki_取消 ki_cancel | 取消异步 I/O 操作时调用的方法 Method invoked when canceling an asynchronous I/O operation |
大小_t ssize_t | ki_重试 ki_retry | 重试异步 I/O 操作时调用的方法 Method invoked when retrying an asynchronous I/O operation |
空白 void | ki_dtor ki_dtor | 销毁
Method invoked when destroying the
|
结构列表头 struct list_head | ki_list ki_list | 异步 I/O 上下文上活动的正在进行的 I/O 操作列表的指针 Pointers for the list of active ongoing I/O operation on an asynchronous I/O context |
联盟 union | 基对象 ki_obj | 对于同步操作,指向发出I/O操作的进程描述符的指针;对于异步操作,指向 For synchronous operations, pointer
to the process descriptor that issued the I/O operation; for
asynchronous operations, pointer to the |
| ki_用户数据 ki_user_data | 返回给用户态进程的值 Value to be returned to the User Mode process |
洛夫_t loff_t | ki_pos ki_pos | 正在进行的 I/O 操作的当前文件位置 Current file position of the ongoing I/O operation |
无符号短 unsigned short | ki_操作码 ki_opcode | 操作类型(读、写或同步) Type of operation (read, write, or sync) |
尺寸_t size_t | 千字节 ki_nbytes | 要传输的字节数 Number of bytes to be transferred |
字符* char * | 基缓冲区 ki_buf | 用户模式缓冲区中的当前位置 Current position in the User Mode buffer |
尺寸_t size_t | 左基数 ki_left | 尚未传输的字节数 Number of bytes yet to be transferred |
等待队列 wait_queue_t | 基等待 ki_wait | 用于异步 I/O 操作的等待队列 Wait queue used for asynchronous I/O operations |
空白 * void * | 私人的 private | 可供文件系统层自由使用 Freely usable by the filesystem layer |
该函数
通过执行宏来generic_file_read( )
初始化描述符,该宏为同步操作设置对象的字段。特别是,该宏将字段设置为、将字段设置为、将字段设置为kiocbinit_sync_kiocbki_keyKIOCB_SYNC_KEYki_filpfilpki_objcurrent。
The generic_file_read( )
function initializes the kiocb
descriptor by executing the init_sync_kiocb macro, which sets the fields
of the object for a synchronous operation. In particular, the macro
sets the ki_key field to KIOCB_SYNC_KEY, the ki_filp field to filp, and the ki_obj field to current.
然后,generic_file_read( )
调用将刚刚填充的和描述符_ _generic_file_aio_read( )
的地址传递给它。后一个函数返回一个值,通常是从文件中有效读取的字节数;ioveckiocbgeneric_file_read( )通过返回该值来终止。
Then, generic_file_read( )
invokes _ _generic_file_aio_read( )
passing to it the addresses of the iovec and kiocb descriptors just filled. The latter
function returns a value, which is usually the number of bytes
effectively read from the file; generic_file_read( ) terminates by returning
this value.
该_ _generic_file_aio_read(
)函数是所有文件系统用来实现同步和异步读取操作的通用例程。iocb该函数接收四个参数:描述符的地址
、描述符数组的kiocb地址、数组的长度以及存储文件当前指针的变量的地址。当由 调用时,描述符数组仅由一个元素组成,描述将接收数据的用户模式缓冲区。[ * ]ioviovecpposgeneric_file_read( )iovec
The _ _generic_file_aio_read(
) function is a general-purpose routine used by all
filesystems to implement both synchronous and asynchronous read
operations. The function receives four parameters: the address
iocb of a kiocb descriptor, the address iov of an array of iovec descriptors, the length of this array,
and the address ppos of a variable
that stores the file's current pointer. When invoked by generic_file_read( ), the array of iovec descriptors is composed of just one
element describing the User Mode buffer that will receive the
data.[*]
现在我们解释该_
_generic_file_aio_read( )函数的动作;为了简单起见,我们将描述限制在最常见的情况:由 a 引发的同步操作read(
)页面缓存文件上的系统调用引发的同步操作。在本章后面,我们将描述该函数在其他情况下的行为方式。像往常一样,我们不讨论如何处理错误和异常情况。
We now explain the actions of the _
_generic_file_aio_read( ) function; for the sake of
simplicity, we restrict the description to the most common case: a
synchronous operation raised by a read(
) system call on a page-cached file. Later in this chapter
we describe how this function behaves in other cases. As usual, we do
not discuss how errors and anomalous conditions are handled.
以下是该函数执行的步骤:
Here are the steps performed by the function:
调用access_ok( )以验证描述符描述的用户模式缓冲区iovec是否有效。由于起始地址和长度是从服务程序中获取的,因此在使用之前必须对其进行检查(参见第10章“验证参数”sys_read( )一节)。如果参数无效,则返回错误代码。-EFAULT
Invokes access_ok( ) to
verify that the User Mode buffer described by the iovec descriptor is valid. Because the
starting address and length have been received from the sys_read( ) service routine, they must
be checked before using them (see the section "Verifying the
Parameters" in Chapter
10). If the parameters are not valid, returns the -EFAULT error code.
设置读操作描述符read_descriptor_t- 即存储相对于单个用户模式缓冲区的正在进行的文件读操作的当前状态的数据结构类型。该描述符的字段如表16-2所示。
Sets up a read operation descriptor —
namely, a data structure of type read_descriptor_t that stores the
current status of the ongoing file read operation relative to a
single User Mode buffer. The fields of this descriptor are shown
in Table
16-2.
调用do_generic_file_read(
),向其传递文件对象指针filp、指向文件偏移量的指针
ppos、刚刚分配的读操作描述符的地址以及函数的地址
file_read_actor( )(见下文)。
Invokes do_generic_file_read(
), passing to it the file object pointer filp, the pointer to the file offset
ppos, the address of the just
allocated read operation descriptor, and the address of the
file_read_actor( ) function
(see later).
返回复制到用户模式缓冲区中的字节数;即在数据结构written字段中找到的值read_descriptor_t。
Returns the number of bytes copied into the User Mode
buffer; that is, the value found in the written field of the read_descriptor_t data structure.
表 16-2。读操作描述符的字段
Table 16-2. The fields of the read operation descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 有多少字节已复制到用户模式缓冲区中 How many bytes have been copied into the User Mode buffer |
| | 有多少字节尚未传输 How many bytes are yet to be transferred |
| | 用户模式缓冲区中的当前位置 Current position in the User Mode buffer |
| | 读操作错误码(0表示无错误) Error code of the read operation (0 for no error) |
该do_generic_file_read( )
函数从磁盘读取请求的页面并将其复制到用户模式缓冲区中。具体来说,该函数执行以下操作:
The do_generic_file_read( )
function reads the requested pages from disk and copies them into the
User Mode buffer. In particular, the function performs the following
actions:
获取address_space
正在读取的文件对应的对象;它的地址存储在filp->f_mapping.
Gets the address_space
object corresponding to the file being read; its address is stored
in filp->f_mapping.
获取对象的所有者address_space,即拥有要填充文件数据的页面的 inode 对象;它的地址存储在对象host
的字段中address_space
。如果正在读取的文件是块设备文件,则所有者是 bdev 中的inode 特殊的文件系统而不是 inode 指向的filp->f_dentry->d_inode(参见第 15 章中的“ address_space 对象” )。
Gets the owner of the address_space object, that is, the inode
object that will own the pages to be filled with file's data; its
address is stored in the host
field of the address_space
object. If the file being read is a block device file, the owner
is an inode in the bdev special filesystem rather than the inode pointed to
by filp->f_dentry->d_inode (see
"The address_space
Object" in Chapter
15).
将文件视为细分为数据页(每页 4,096 字节)。该函数从文件指针派生出
*ppos包含第一个请求字节的页面的逻辑编号(即该页面在地址空间中的索引),并将其存储在局部index变量中。该函数还将offset第一个请求字节的页内位移存储在局部变量中。
Considers the file as subdivided in pages of data (4,096
bytes per page). The function derives from the file pointer
*ppos the logical number of the
page that includes the first requested byte—that is, the page's
index in the address space—and stores it in the index local variable. The function also
stores in the offset local
variable the displacement inside the page of the first requested
byte.
启动一个周期来读取包含所请求字节的所有页;要读取的字节数存储在描述符count的字段中read_descriptor_t。在单次迭代期间,该函数通过执行以下子步骤来传输一页数据:
如果index*4096+offset
超过inode对象字段中存储的文件大小i_size,则退出循环并转到步骤5。
调用cond_resched(
)以检查TIF_NEED_RESCHED当前进程的标志,如果设置了标志,则调用该schedule( )函数。
如果必须提前读取其他页面,则会调用
page_cache_readahead( )读取它们。我们将讨论预读推迟到后面的“文件预读”部分。
调用将find_get_page(
)指向address_space对象的指针和 的值作为参数传递index;该函数查找页面缓存以查找存储所请求数据的页面描述符(如果有)。
如果find_get_page( )
返回一个NULL指针,则请求的页面不在页面缓存中。在这种情况下,它执行以下操作:
调用handle_ra_miss(
)以调整预读系统使用的参数。
分配一个新页面。
通过调用将新页面的描述符插入页面缓存add_to_page_cache( )。请记住,此函数设置PG_locked新页面的标志。
通过调用将新页面的描述符插入到LRU链表中lru_cache_add(
)(参见第17章)。
跳转到步骤 4j 以开始读取文件的数据。
如果函数已到达此点,则该页位于页缓存中。检查PG_uptodate标志;如果设置了,则存储在页面中的数据是最新的,因此不需要从磁盘读取它:跳转到步骤4m。
该页上的数据无效,因此必须从磁盘读取。该函数通过调用该函数获得对该页面的独占访问权限lock_page( )
。如第 15 章“页缓存处理函数”部分所述,如果该
标志已设置,则挂起当前进程,直到该位被清除。lock_page( )PG_locked
现在该页面已被当前进程锁定。然而,另一个进程可能在上一步之前已经从页面缓存中删除了该页面;因此,它检查mapping页面描述符的字段是否为NULL;在这种情况下,它通过调用 来解锁页面unlock_page( ),减少其使用计数器(增加了find_get_page( )),然后跳回到步骤 4a,从同一页面开始。
如果函数已到达此点,则页面将被锁定并仍然存在于页面缓存中。再次检查该
PG_uptodate标志,因为另一个内核控制路径可能已完成步骤 4f 和 4g 之间的必要读取。如果设置了该标志,则调用unlock_page( )并跳转到步骤 4m 以跳过读取操作。
现在可以开始实际的 I/O 操作。调用
文件对象readpage的方法
。address_space相应的函数负责激活从磁盘到页面的 I/O 数据传输。我们稍后讨论该函数对常规文件和块设备文件的作用。
如果该PG_uptodate
标志仍然被清除,它将等待,直到通过调用该函数有效地读取该页lock_page( )。在步骤 4g 中锁定的页面将在读操作完成后立即解锁。因此,当前进程将处于休眠状态,直到 I/O 数据传输终止。
如果index超出文件大小(以页为单位)(该数字是通过将 inode 对象的字段值除以i_size4,096 获得的),则减少页面的使用计数器,并退出循环,跳转到步骤 5。当文件正在读取的数据同时被另一个进程截断。
nr
将页面中应复制到用户模式缓冲区中的字节数存储在本地变量中。该值等于页面大小(4,096 字节),除非其中一个offset不为零(这种情况仅发生在请求数据的第一页或最后一页),或者文件不包含所有请求的字节。
调用mark_page_accessed(
)设置PG_referenced或PG_active标志,从而表示该页面正在使用并且不应被换出(参见第 17 章)。如果在连续执行中多次读取同一页(或其一部分)do_generic_file_read( ),则仅在第一次读取时执行此步骤。
现在是时候将页面上的数据复制到用户模式缓冲区中了。为此,do_generic_file_read( )调用该
file_read_actor( )
函数,其地址已作为参数传递。依次file_read_actor( )
本质上执行以下步骤:
调用kmap( ),如果页面位于高端内存中,它会为该页面建立永久的内核映射(请参阅第 8 章中的“高端内存页帧的内核映射”部分)。
调用,它将页面上的数据复制到用户模式地址空间中(请参阅第 10 章中的“访问进程地址空间”_ _copy_to_user(
)部分)。请注意,此操作可能会因访问用户模式地址空间时出现页面错误而阻塞进程。
调用kunmap( )
以释放页面的任何永久内核映射。
更新描述符的count、written和buf字段read_descriptor_t
。
根据用户模式缓冲区中有效传输的字节数更新index和
局部变量。offset通常,如果页面中的最后一个字节已被复制到用户模式缓冲区中,index则加一并
offset设置为零;否则,index不增加并offset设置为已复制到用户模式缓冲区的页面中的字节数。
减少页面描述符使用计数器。
如果描述count符的字段read_descriptor_t
不为零,则说明还有其他数据要从文件中读取:跳转到步骤4a,继续循环处理文件中的下一页数据。
Starts a cycle to read all pages that include the requested
bytes; the number of bytes to be read is stored in the count field of the read_descriptor_t descriptor. During a
single iteration, the function transfers a page of data by
performing the following substeps:
If index*4096+offset
exceeds the file size stored in the i_size field of the inode object, it
exits from the cycle and goes to step 5.
Invokes cond_resched(
) to check the TIF_NEED_RESCHED flag of the current
process and, if the flag is set, to invoke the schedule( ) function.
If additional pages must be read in advance, it invokes
page_cache_readahead( ) to
read them. We defer discussing read-ahead until the later
section "Read-Ahead of
Files."
Invokes find_get_page(
) passing as parameters a pointer to the address_space object and the value
of index; the function
looks up the page cache to find the descriptor of the page
that stores the requested data, if any.
If find_get_page( )
returned a NULL pointer,
the page requested is not in the page cache. In that case, it
performs the following actions:
Invokes handle_ra_miss(
) to tune the parameters used by the read-ahead
system.
Allocates a new page.
Inserts the descriptor of the new page into the page
cache by invoking add_to_page_cache( ). Remember
that this function sets the PG_locked flag of the new
page.
Inserts the descriptor of the new page into the LRU
list by invoking lru_cache_add(
) (see Chapter 17).
Jumps to step 4j to start reading the file's data.
If the function has reached this point, the page is in
the page cache. Checks the PG_uptodate flag; if it is set, the
data stored in the page is up-to-date, hence there is no need
to read it from disk: jumps to step 4m.
The data on the page is not valid, so it must be read
from disk. The function gains exclusive access to the page by
invoking the lock_page( )
function. As described in the section "Page Cache Handling
Functions" in Chapter 15, lock_page( ) suspends the current
process if the PG_locked
flag is already set, until that bit is cleared.
Now the page is locked by the current process. However,
another process might have removed the page from the page
cache right before the previous step; hence, it checks whether
the mapping field of the
page descriptor is NULL; in
this case, it unlocks the page by invoking unlock_page( ), decreases its usage
counter (it was increased by find_get_page( )), and jumps back to
step 4a starting over with the same page.
If the function has reached this point, the page is
locked and still present in the page cache. Checks the
PG_uptodate flag again,
because another kernel control path could have completed the
necessary read between steps 4f and 4g. If the flag is set, it
invokes unlock_page( ) and
jumps to step 4m to skip the read operation.
Now the actual I/O operation can be started. Invokes the
readpage method of the
address_space object of the
file. The corresponding function takes care of activating the
I/O data transfer from the disk to the page. We discuss later
what this function does for regular files and block device
files.
If the PG_uptodate
flag is still cleared, it waits until the page has been
effectively read by invoking the lock_page( ) function. The page,
which was locked in step 4g, will be unlocked as soon as the
read operation finishes. Therefore, the current process sleeps
until the I/O data transfer terminates.
If index exceeds the
file size in pages (this number is obtained by dividing the
value of the i_size field
of the inode object by 4,096), it decreases the page's usage
counter, and exits from the cycle jumping to step 5. This case
occurs when the file being read is concurrently truncated by
another process.
Stores in the nr
local variable the number of bytes in the page that should be
copied into the User Mode buffer. This value is equal to the
page size (4,096 bytes) unless either offset is not zero—this can happen
only for the first or last page of requested data—or the file
does not contain all requested bytes.
Invokes mark_page_accessed(
) to set the PG_referenced or the PG_active flag, hence denoting the
fact that the page is being used and should not be swapped out
(see Chapter 17).
If the same page (or part thereof) is read several times in
successive executions of do_generic_file_read( ), this step
is executed only during the first read.
Now it is time to copy the data on the page into the
User Mode buffer. To do this, do_generic_file_read( ) invokes the
file_read_actor( )
function, whose address has been passed as a parameter. In
turn, file_read_actor( )
essentially executes the following steps:
Invokes kmap( ),
which establishes a permanent kernel mapping for the page
if it is in high memory (see the section "Kernel Mappings of
High-Memory Page Frames" in Chapter 8).
Invokes _ _copy_to_user(
), which copies the data on the page in the User
Mode address space (see the section "Accessing the
Process Address Space" in Chapter 10). Notice
that this operation might block the process because of
page faults while accessing the User Mode address
space.
Invokes kunmap( )
to release any permanent kernel mapping of the
page.
Updates the count, written, and buf fields of the read_descriptor_t
descriptor.
Updates the index and
offset local variables
according to the number of bytes effectively transferred in
the User Mode buffer. Typically, if the last byte in the page
has been copied into the User Mode buffer, index is increased by one and
offset is set to zero;
otherwise, index is not
increased and offset is set
to the number of bytes in the page that have been copied into
the User Mode buffer.
Decreases the page descriptor usage counter.
If the count field of
the read_descriptor_t
descriptor is not zero, there is other data to be read from
the file: jumps to step 4a to continue the loop with the next
page of data in the file.
所有请求的或可用的字节都已被读取。该函数更新filp->f_ra预读数据结构以记录数据正在从文件中顺序读取的事实(请参阅后面的“文件预读”部分)。
All requested—or available—bytes have been read. The
function updates the filp->f_ra read-ahead data structure
to record the fact that data is being read sequentially from the
file (see the later section "Read-Ahead of
Files").
分配给*pposvalue index*4096+offset,从而存储将发生顺序访问的下一个位置,以便将来调用read(
)和write( )系统调用。
Assigns to *ppos the
value index*4096+offset, thus
storing the next position where a sequential access is to occur
for a future invocation of the read(
) and write( ) system
calls.
调用update_atime( )
将当前时间存储在i_atime文件 inode 字段中并将 inode 标记为脏,然后返回。
Invokes update_atime( )
to store the current time in the i_atime field of the file's inode and to
mark the inode as dirty, and returns.
正如我们所看到的,该readpage方法被重复使用
do_generic_file_read( )将各个页面从磁盘读取到内存中。
As we saw, the readpage method is used repeatedly by
do_generic_file_read( ) to read
individual pages from disk into memory.
该对象readpage的方法
address_space存储了有效激活从物理磁盘到页缓存的 I/O 数据传输的函数的地址。对于常规文件,该字段通常指向调用该
mpage_readpage( )函数的包装器。例如,readpageExt3文件系统的方法由以下函数实现:
The readpage method of the
address_space object stores the
address of the function that effectively activates the I/O data
transfer from the physical disk to the page cache. For regular
files, this field typically points to a wrapper that invokes the
mpage_readpage( ) function. For
instance, the readpage method of
the Ext3 filesystem is implemented by the following function:
int ext3_readpage(结构文件*文件,结构页面*页面)
{
返回 mpage_readpage(page, ext3_get_block);
}int ext3_readpage(struct file *file, struct page *page)
{
return mpage_readpage(page, ext3_get_block);
}需要包装器是因为该函数接收要填充的页面的mpage_readpage( )描述符以及帮助找到正确块的函数的
地址作为其参数。包装器是特定于文件系统的,因此可以提供适当的函数来获取块。该函数将相对于文件开头的块号转换为相对于该块在磁盘分区中的位置的逻辑块号(例如,参见第18章)。当然,后一个参数取决于常规文件所属的文件系统类型;在前面的例子中,参数是地址pageget_blockmpage_readpage( )ext3_get_block( )
功能。传递的函数一如既往get_block地使用缓冲区头来存储有关块设备(b_dev字段)、请求数据在设备上的位置(b_blocknr字段)以及块状态(b_state字段)的宝贵信息。
The wrapper is needed because the mpage_readpage( ) function receives as its
parameters the descriptor page of
the page to be filled and the address get_block of a function that helps
mpage_readpage( ) find the right
block. The wrapper is filesystem-specific and can therefore supply
the proper function to get a block. This function translates the
block numbers relative to the beginning of the file into logical
block numbers relative to positions of the block in the disk
partition (for an example, see Chapter 18). Of course, the
latter parameter depends on the type of filesystem to which the
regular file belongs; in the previous example, the parameter is the
address of the ext3_get_block( )
function. The function passed as get_block always uses a buffer head to
store precious information about the block device (b_dev field), the position of the
requested data on the device (b_blocknr field), and the block status
(b_state field).
mpage_readpage( )
从磁盘读取页面时,该函数会在两种不同的策略之间进行选择。如果包含所请求数据的块连续位于磁盘上,则该函数使用单个 bio 描述符将读取 I/O 操作提交给通用块层。在相反的情况下,页面中的每个块都是使用不同的生物描述符来读取的。依赖于文件系统的get_block函数在确定文件中的下一个块是否也是磁盘上的下一个块方面起着至关重要的作用。
The mpage_readpage( )
function chooses between two different strategies when reading a
page from disk. If the blocks that contain the requested data are
contiguously located on disk, then the function submits the read I/O
operation to the generic block layer by using a single bio
descriptor. In the opposite case, each block in the page is read by
using a different bio descriptor. The filesystem-dependent get_block function plays the crucial role
of determining whether the next block in the file is also the next
block on the disk.
具体mpage_readpage(
)执行以下步骤:
Specifically, mpage_readpage(
) performs the following steps:
检查PG_private
页面描述符的字段:如果设置了该字段,则该页面是一个缓冲区页面,即该页面与描述组成该页面的块的缓冲区头列表相关联(参见“在页面中存储块”一节)第 15 章中的“缓存” )。这意味着该页过去已经从磁盘读取过,并且该页中的块在磁盘上不相邻:跳转到步骤11以一次一个块地读取该页。
Checks the PG_private
field of the page descriptor: if it is set, the page is a buffer
page, that is, the page is associated with a list of buffer
heads describing the blocks that compose the page (see the
section "Storing
Blocks in the Page Cache" in Chapter 15). This means
that the page has already been read from disk in the past, and
that the blocks in the page are not adjacent on disk: jumps to
step 11 to read the page one block at a time.
检索块大小(存储在page->mapping->host->i_blkbits
inode 字段中),并计算访问该页上所有块所需的两个值:该页中存储的块数和该页中第一个块的文件块号,即页面中第一个块相对于文件开头的索引。
Retrieves the block size (stored in the page->mapping->host->i_blkbits
inode field), and computes two values required to access all
blocks on that page: the number of blocks stored in the page and
the file block number of the first block in the page—that is,
the index of the first block in the page relative to the
beginning of the file.
对于页面中的每个块,调用get_block作为参数传递的与文件系统相关的函数来获取逻辑块号,即相对于磁盘或分区开头的块的索引。页中所有块的逻辑块号存储在本地数组中。
For each block in the page, invokes the
filesystem-dependent get_block function passed as a
parameter to get the logical block number, that is, the index of
the block relative to the beginning of the disk or partition.
The logical block numbers of all blocks in the page are stored
in a local array.
检查执行上一步时可能发生的任何异常情况。特别是,如果某些块在磁盘上不相邻,或者某些块落入“文件洞”(请参阅第 18 章中的
“文件洞”部分),或者块缓冲区已被该函数填满,则跳转到步骤 11 一次读取一个块的页面。get_block
Checks for any anomalous condition that could occur while
executing the previous step. In particular, if some blocks are
not adjacent on disk, or some block falls inside a "file hole"
(see the section "File Holes" in
Chapter 18), or a
block buffer has been already filled by the get_block function, then jumps to step
11 to read the page one block at a time.
如果函数已到达此点,则该页上的所有块在磁盘上都是相邻的。但是,该页可能是文件中数据的最后一页,因此该页中的某些块可能在磁盘上没有图像。如果是,则用零填充该页中相应的块缓冲区;否则,它设置PG_mappedtodisk
页面描述符的标志。
If the function has reached this point, all blocks on the
page are adjacent on disk. However, the page could be the last
page of data in the file, hence some of the blocks in the page
might not have an image on disk. If so, it fills the
corresponding block buffers in the page with zeros; otherwise,
it sets the PG_mappedtodisk
flag of the page descriptor.
调用bio_alloc( )分配一个由单个段组成的新生物描述符,并分别使用块设备描述符的地址和页面中第一个块的逻辑块号来初始化其bi_bdev和
字段。bi_sector两条信息均已在上述步骤 3 中确定。
Invokes bio_alloc( ) to
allocate a new bio descriptor consisting of a single segment and
to initialize its bi_bdev and
bi_sector fields with the
address of the block device descriptor and the logical block
number of the first block in the page, respectively. Both pieces
of information have been determined in step 3 above.
使用页的初始地址、要读取的第一个字节的偏移量(零)以及要读取的总字节数设置bio_vec
Bio 段的描述符。
Sets the bio_vec
descriptor of the bio's segment with the initial address of the
page, the offset of the first byte to be read (zero), and the
total number of bytes to be read.
将函数的地址存储mpage_end_io_read( )在
bio->bi_end_io字段中(见下文)。
Stores the address of the mpage_end_io_read( ) function in the
bio->bi_end_io field (see
below).
Invokes submit_bio( ),它bi_rw使用数据传输的方向设置标志,更新page_states每个 CPU 变量以跟踪读取扇区的数量,并调用Bio 描述符上的函数(请参阅“向 I/O 发出请求”generic_make_request( )部分)O 调度程序”(第14 章)。
Invokes submit_bio( ),
which sets the bi_rw flag
with the direction of the data transfer, updates the page_states per-CPU variable to keep
track of the number of read sectors, and invokes the generic_make_request( ) function on
the bio descriptor (see the section "Issuing a Request to the
I/O Scheduler" in Chapter 14).
返回值零(成功)。
Returns the value zero (success).
如果函数跳转到此处,则该页包含磁盘上不相邻的块。如果页面是最新的(PG_uptodate设置了标志),则调用该函数unlock_page( )来解锁页面;否则,它会调用block_read_full_page( )开始一次读取一个块的页面(见下文)。
If the function jumps here, the page contains blocks that
are not adjacent on disk. If the page is up-to-date (PG_uptodate flag set), the function
invokes unlock_page( ) to
unlock the page; otherwise, it invokes block_read_full_page( ) to start
reading the page one block at a time (see below).
返回值零(成功)。
Returns the value zero (success).
该mpage_end_io_read( )
函数是bio的补全方法;一旦 I/O 数据传输终止,它就会被执行。假设没有 I/O 错误,该函数本质上是设置PG_uptodate页面描述符的标志,调用unlock_page( )以解锁页面并唤醒为此事件休眠的任何进程,并调用bio_put( )以销毁 Bio 描述符。
The mpage_end_io_read( )
function is the completion method of the bio; it is executed as soon
as the I/O data transfer terminates. Assuming that there was no I/O
error, the function essentially sets the PG_uptodate flag of the page descriptor,
invokes unlock_page( ) to unlock
the page and to wake up any process sleeping for this event, and
invokes bio_put( ) to destroy the
bio descriptor.
在第13章的“设备文件的VFS处理”和第14章的“打开块设备文件”部分中,我们讨论了内核如何处理打开块设备文件的请求。我们看到了该
函数如何设置设备索引节点以及该函数如何完成打开阶段。init_special_inode( )blkdev_open( )
In the sections "VFS Handling of Device
Files" in Chapter
13 and "Opening a Block Device File" in Chapter 14, we discussed how
the kernel handles requests to open a block device file. We saw how
the init_special_inode( )
function sets up the device inode and how the blkdev_open( ) function completes the
opening phase.
块设备使用一个address_space对象,该对象存储在
bdevi_data中相应块设备inode的字段中 特殊的文件系统。与常规文件不同(其
对象readpage中的方法address_space取决于文件所属的文件系统类型),readpage块设备文件的方法始终相同。它是由blkdev_readpage( )函数实现的,该函数调用
block_read_full_page( ):
Block devices use an address_space object that is stored in the
i_data field of the corresponding
block device inode in the bdev special filesystem. Unlike regular files — whose
readpage method in the address_space object depends on the
filesystem type to which the file belongs — the readpage method of block device files is
always the same. It is implemented by the blkdev_readpage( ) function, which calls
block_read_full_page( ):
int blkdev_readpage(struct file * file, struct * page page)
{
返回 block_read_full_page(page, blkdev_get_block);
}int blkdev_readpage(struct file * file, struct * page page)
{
return block_read_full_page(page, blkdev_get_block);
}正如您所看到的,该函数再次成为block_read_full_page(
)函数的包装器。这次第二个参数指向一个函数,该函数将相对于文件开头的文件块号转换为相对于块设备开头的逻辑块号。然而,对于块设备文件,这两个数字是一致的;因此,该blkdev_get_block( )函数执行以下步骤:
As you can see, the function is once again a wrapper, this
time for the block_read_full_page(
) function. This time the second parameter points to a
function that translates the file block number relative to the
beginning of the file into a logical block number relative to the
beginning of the block device. For block device files, however, the
two numbers coincide; therefore, the blkdev_get_block( ) function performs the
following steps:
bdev->bd_inode->i_size检查页中第一个块的编号是否超过块设备中最后一个块的索引(该索引是通过 中存储的块设备的大小除以 中存储的块大小得到的;指向该页的bdev->bd_block_size描述bdev符块设备)。-EIO如果是,则对于写操作返回零,对于读操作返回零。(也不允许读取超出块设备末尾的内容,但不应在此处返回错误代码:内核可能只是尝试调度对块设备的最后数据以及相应的缓冲区页的读取请求仅部分映射。)
Checks whether the number of the first block in the page
exceeds the index of the last block in the block device (this
index is obtained by dividing the size of the block device
stored in bdev->bd_inode->i_size by the
block size stored in bdev->bd_block_size; bdev points to the descriptor of the
block device). If so, it returns -EIO for a write operation, or zero
for a read operation. (Reading beyond the end of a block device
is not allowed, either, but the error code should not be
returned here: the kernel could just be trying to dispatch a
read request for the last data of a block device, and the
corresponding buffer page is only partially mapped.)
将b_dev缓冲区头的字段设置为bdev。
Sets the b_dev field of
the buffer head to bdev.
b_blocknr
将缓冲区头的字段设置为文件块编号,该编号作为函数的参数传递。
Sets the b_blocknr
field of the buffer head to the file block number, which was
passed as a parameter of the function.
设置BH_Mapped缓冲区头的标志以表明缓冲区头的b_dev和字段是有效的。b_blocknr
Sets the BH_Mapped flag
of the buffer head to state that the b_dev and b_blocknr fields of the buffer head
are significant.
该block_read_full_page( )
函数一次读取一页数据一个块。正如我们所看到的,它在读取块设备文件和读取块在磁盘上不相邻的常规文件的页面时都会使用。它执行以下步骤:
The block_read_full_page( )
function reads a page of data one block at a time. As we have seen,
it is used both when reading block device files and when reading
pages of regular files whose blocks are not adjacent on disk. It
performs the following steps:
检查PG_private
页面描述符的标志;如果它被设置,该页与描述组成该页的块的缓冲区头列表相关联(参见第15章中的“在页高速缓存中存储块”部分)。否则,该函数调用为页面中包含的所有块缓冲区分配缓冲区头。页中第一个缓冲区的缓冲区头地址存储在该字段中。每个缓冲区头的字段都指向该页中下一个缓冲区的缓冲区头。create_empty_buffers( )page->privateb_this_page
Checks the PG_private
flag of the page descriptor; if it is set, the page is
associated with a list of buffer heads describing the blocks
that compose the page (see the section "Storing Blocks in the Page
Cache" in Chapter
15). Otherwise, the function invokes create_empty_buffers( ) to allocate
buffer heads for all block buffers included in the page. The
address of the buffer head for the first buffer in the page is
stored in the page->private field. The b_this_page field of each buffer head
points to the buffer head of the next buffer in the page.
从相对于页面(page->index字段)的文件偏移量得出页面中第一个块的文件块号。
Derives from the file offset relative to the page
(page->index field) the
file block number of the first block in the page.
对于页面中缓冲区的每个缓冲区头,它执行以下子步骤:
如果BH_Uptodate
设置了该标志,它将跳过缓冲区并继续处理页面中的下一个缓冲区。
如果BH_Mapped
未设置该标志并且该块未超出文件末尾,则它将调用与文件系统相关的get_block函数,该函数的地址已作为参数传递。对于常规文件,该函数查找文件系统的磁盘数据结构,并查找相对于磁盘或分区开头的缓冲区的逻辑块号。相反,对于块设备文件,该函数将文件块号视为逻辑块号。在这两种情况下,该函数都将逻辑块号存储在
b_blocknr相应缓冲区头的字段中并设置标志BH_Mapped。[ * ]
再次测试该BH_Uptodate标志,因为与文件系统相关的get_block函数可能触发了更新缓冲区的块 I/O 操作。如果
BH_Uptodate设置了,它将继续处理页面中的下一个缓冲区。
将缓冲区头的地址存储在arr本地数组中,并继续处理页面中的下一个缓冲区。
For each buffer head of the buffers in the page, it performs the following substeps:
If the BH_Uptodate
flag is set, it skips the buffer and continues with the next
buffer in the page.
If the BH_Mapped
flag is not set and the block is not beyond the end of the
file, it invokes the filesystem-dependent get_block function whose address
has been passed as a parameter. For a regular file, the
function looks in the on-disk data structures of the
filesystem and finds the logical block number of the buffer
relative to the beginning of the disk or partition.
Conversely, for a block device file, the function regards
the file block number as the logical block number. In both
cases the function stores the logical block number in the
b_blocknr field of the
corresponding buffer head and sets the BH_Mapped flag.[*]
Tests again the BH_Uptodate flag because the
filesystem-dependent get_block function could have
triggered a block I/O operation that updated the buffer. If
BH_Uptodate is set, it
continues with the next buffer in the page.
Stores the address of the buffer head in arr local array, and continues
with the next buffer in the page.
如果在上一步中没有遇到文件漏洞,则该函数设置PG_mappedtodisk该页面的标志。
If no file hole has been encountered in the previous step,
the function sets the PG_mappedtodisk flag of the
page.
现在arr本地数组存储了内容不是最新的缓冲区对应的缓冲区头的地址。如果该数组为空,则该页中的所有缓冲区均有效。因此该函数设置
PG_uptodate页面描述符的标志,通过调用解锁页面unlock_page( ),然后返回。
Now the arr local array
stores the addresses of the buffer heads that correspond to the
buffers whose content is not up-to-date. If this array is empty,
all buffers in the page are valid. So the function sets the
PG_uptodate flag of the page
descriptor, unlocks the page by invoking unlock_page( ), and returns.
本地arr数组不为空。对于数组中的每个缓冲区头,block_read_full_page( )执行以下子步骤:
设置BH_Lock
标志。如果该标志已设置,则该函数将等待直到缓冲区被释放。
b_end_io
将缓冲区头的字段设置为end_buffer_async_read( )函数的地址(见下文)并设置BH_Async_Read缓冲区头的标志。
The arr local array is
not empty. For each buffer head in the array, block_read_full_page( ) performs the
following substeps:
Sets the BH_Lock
flag. If the flag was already set, the function waits until
the buffer is released.
Sets the b_end_io
field of the buffer head to the address of the end_buffer_async_read( ) function
(see below) and sets the BH_Async_Read flag of the buffer
head.
对于本地数组中的每个缓冲区头arr,它调用
submit_bh( )其上的函数,指定操作类型READ。正如我们之前看到的,该函数触发相应块的 I/O 数据传输。
For each buffer head in the arr local array, it invokes the
submit_bh( ) function on it,
specifying the operation type READ. As we saw earlier, this function
triggers the I/O data transfer of the corresponding
block.
返回 0。
Returns 0.
该end_buffer_async_read(
)函数是缓冲区头的完成方法;一旦块缓冲区上的 I/O 数据传输终止,就会执行它。假设没有 I/O 错误,该函数设置BH_Uptodate缓冲区头的标志并清除该BH_Async_Read标志。然后,该函数获取包含块缓冲区的缓冲区页的描述符(其地址存储在b_page缓冲区头的字段中),并检查该页中的所有块是否都是最新的;如果是这样,该函数设置PG_uptodate
页面的标志并调用unlock_page(
).
The end_buffer_async_read(
) function is the completion method of the buffer head; it
is executed as soon as the I/O data transfer on the block buffer
terminates. Assuming that there was no I/O error, the function sets
the BH_Uptodate flag of the
buffer head and clears the BH_Async_Read flag. Then, the function
gets the descriptor of the buffer page containing the block buffer
(its address is stored in the b_page field of the buffer head) and
checks whether all blocks in the page are up-to-date; if so, the
function sets the PG_uptodate
flag of the page and invokes unlock_page(
).
许多磁盘访问都是顺序的。正如我们将在 第 18 章中看到的,常规文件存储在磁盘上的大组相邻扇区中,因此只需移动磁盘头即可快速检索它们。当程序读取或复制文件时,它通常会按顺序访问该文件,从第一个字节到最后一个字节。因此,在处理同一文件上的一系列进程的读取请求时,可能会获取磁盘上的许多相邻扇区。
Many disk accesses are sequential. As we will see in Chapter 18, regular files are stored on disk in large groups of adjacent sectors, so that they can be retrieved quickly with few moves of the disk heads. When a program reads or copies a file, it often accesses it sequentially, from the first byte to the last one. Therefore, many adjacent sectors on disk are likely to be fetched when handling a series of a process's read requests on the same file.
预读包括 实际请求之前读取常规文件或块设备文件的多个相邻数据页。大多数情况下,需要提前阅读显着增强磁盘性能,因为它使磁盘控制器处理更少的命令,每个命令都引用较大的相邻扇区块。此外,它还提高了系统响应能力。顺序读取文件的进程通常不必等待请求的数据,因为该数据已经在 RAM 中可用。
Read-ahead consists of reading several adjacent pages of data of a regular file or block device file before they are actually requested. In most cases, read-ahead significantly enhances disk performance, because it lets the disk controller handle fewer commands, each of which refers to a larger chunk of adjacent sectors. Moreover, it improves system responsiveness. A process that is sequentially reading a file does not usually have to wait for the requested data because it is already available in RAM.
然而,当应用程序对文件执行随机访问时,预读是没有用的。在这种情况下,它实际上是有害的,因为它往往会用无用的信息浪费页面缓存中的空间。因此,当内核确定最近发出的 I/O 访问与前一个不连续时,就会减少(或停止)预读。
However, read-ahead is of no use when an application performs random accesses to files; in this case, it is actually detrimental because it tends to waste space in the page cache with useless information. Therefore, the kernel reduces—or stops—read-ahead when it determines that the most recently issued I/O access is not sequential to the previous one.
出于以下几个原因,文件预读需要复杂的算法:
Read-ahead of files requires a sophisticated algorithm for several reasons:
Because data is read page by page, the read-ahead algorithm does not have to consider the offsets inside the page, but only the positions of the accessed pages inside the file.
只要进程保持顺序访问文件,预读就可以逐渐增加。
Read-ahead may be gradually increased as long as the process keeps accessing the file sequentially.
当当前访问相对于前一个访问(随机访问)不连续时,必须缩小甚至禁用预读。
Read-ahead must be scaled down or even disabled when the current access is not sequential with respect to the previous one (random access).
当进程不断地一遍又一遍地访问相同的页面(仅使用文件的一小部分)时,或者当文件的几乎所有页面都已在页面缓存中时,应停止预读。
Read-ahead should be stopped when a process keeps accessing the same pages over and over again (only a small portion of the file is being used), or when almost all pages of the file are already in the page cache.
低级 I/O 设备驱动程序应在适当的时间激活,以便将来的页面在进程需要时已被传输。
The low-level I/O device driver should be activated at the proper time, so that the future pages will have been transferred when the process needs them.
如果请求的第一页是前一访问中请求的最后一页之后的页面,则内核认为文件访问相对于前一文件访问是 顺序的。
The kernel considers a file access as sequential with respect to the previous file access if the first page requested is the page following the last page requested in the previous access.
在访问给定文件时,预读算法使用两组页面,每组页面对应于文件的连续部分。这两组称为当前窗口 和前面的窗户 。
While accessing a given file, the read-ahead algorithm makes use of two sets of pages, each of which corresponds to a contiguous portion of the file. These two sets are called the current window and the ahead window .
当前窗口由进程请求的页面或内核预先读取并包含在页面缓存中的页面组成。(当前窗口中的页面不一定是最新的,因为它的 I/O 数据传输可能仍在进行中。)当前窗口既包含进程顺序访问的最后一页,也可能包含进程访问的最后一页。已被内核提前读取,但进程尚未请求。
The current window consists of pages requested by the process or read in advance by the kernel and included in the page cache. (A page in the current window is not necessarily up-to-date, because its I/O data transfer could be still in progress.) The current window contains both the last pages sequentially accessed by the process and possibly some of the pages that have been read in advance by the kernel but that have not yet been requested by the process.
前置窗口由当前窗口中的页面之后的页面组成,这些页面当前正在被内核提前读取。进程尚未请求前置窗口中的页面,但内核假定进程迟早会请求它们。
The ahead window consists of pages—following the ones in the current window—that are being currently being read in advance by the kernel. No page in the ahead window has yet been requested by the process, but the kernel assumes that sooner or later the process will request them.
当内核识别出顺序访问并且初始页属于当前窗口时,它会检查前置窗口是否已经建立。如果没有,内核会创建一个新的预置窗口并触发相应页面的读取操作。在理想情况下,进程仍然从当前窗口请求页面,同时传输前面窗口中的页面。当进程请求包含在前置窗口中的页面时,前置窗口成为新的当前窗口。
When the kernel recognizes a sequential access and the initial page belongs to the current window, it checks whether the ahead window has already been set up. If not, the kernel creates a new ahead window and triggers the read operations for the corresponding pages. In the ideal case, the process still requests pages from the current window while the pages in the ahead window are being transferred. When the process requests a page included in the ahead window, the ahead window becomes the new current window.
预读算法使用的主要数据结构是
file_ra_state描述符,其字段列于表16-3中。每个文件对象在其字段中都包含这样一个描述符
f_ra。
The main data structure used by the read-ahead algorithm is the
file_ra_state descriptor whose
fields are listed in Table
16-3. Each file object includes such a descriptor in its
f_ra field.
表 16-3。file_ra_state 描述符的字段
Table 16-3. The fields of the file_ra_state descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 当前窗口第一页的索引 Index of first page in the current window |
| | 当前窗口中包含的页数( Number of pages included in the
current window ( |
| | 用于控制预读的标志 Flags used to control the read-ahead |
无符号长 unsigned long | 缓存命中 cache_hit | 连续缓存命中数(进程请求并在页面缓存中找到的页面) Number of consecutive cache hits (pages requested by the process and found in the page cache) |
| | 进程请求的最后一页的索引 Index of the last page requested by the process |
| 提前开始 ahead_start | 前面窗口中第一页的索引 Index of the first page in the ahead window |
| 前进尺寸 ahead_size | 前方窗口中的页数(0 表示前方窗口为空) Number of pages in the ahead window (0 for an empty ahead window) |
| ra_pages ra_pages | 预读窗口的最大页大小(0 表示永久禁用预读) Maximum size in pages of a read-ahead window (0 for read-ahead permanently disabled) |
| mmap_hit mmap_hit | 预读命中计数器(用于内存映射文件) Read-ahead hit counter (used for memory mapped files) |
| mmap_miss mmap_miss | 预读未命中计数器(用于内存映射文件) Read-ahead miss counter (used for memory mapped files) |
当一个文件被打开时,除了和字段之外,其描述符的所有字段都file_ra_state被设置为零。prev_pagera_pages
When a file is opened, all the fields of its file_ra_state descriptor are set to zero
except the prev_page and ra_pages fields.
该prev_page字段存储进程在上一次读操作中请求的最后一页的索引;最初,该字段包含值-1。
The prev_page field stores
the index of the last page requested by the process in the previous
read operation; initially, the field contains the value -1.
该ra_pages字段表示当前窗口的最大页大小,即文件允许的最大预读量。该字段的初始(默认)值存储在backing_dev_info包含该文件的块设备的描述符中(请参阅第 14 章中的“请求队列描述符”部分)。应用程序可以通过修改字段来调整给定打开文件的预读算法;这可以通过调用来完成ra_pagesposix_fadvise( ) 系统调用,向其传递命令POSIX_FADV_NORMAL(将预读最大大小设置为默认值,通常为 32 页),POSIX_FADV_SEQUENTIAL(将预读最大大小设置为默认值的两倍),以及POSIX_FADV_RANDOM(将预读最大大小设置为零,从而永久禁用预读)。
The ra_pages field represents
the maximum size in pages for the current window, that is, the maximum
read-ahead allowed for the file. The initial (default) value for this
field is stored in the backing_dev_info descriptor of the block
device that includes the file (see the section "Request Queue
Descriptors" in Chapter
14). An application can tune the read-ahead algorithm for a
given opened file by modifying the ra_pages field; this can be done by invoking
the posix_fadvise( ) system call, passing to it the commands POSIX_FADV_NORMAL (set read-ahead maximum
size to default, usually 32 pages), POSIX_FADV_SEQUENTIAL (set read-ahead
maximum size to two times the default), and POSIX_FADV_RANDOM (set read-ahead maximum
size to zero, thus permanently disabling read-ahead).
该flags字段包含两个起着重要作用的标志,称为RA_FLAG_MISS和
。RA_FLAG_INCACHE当在页面缓存中找不到预先读取的页面时(可能是因为它已被内核回收以释放内存;请参阅第 17 章),第一个标志被设置:在这种情况下,要创建的下一个提前窗口有所减少。当内核确定进程请求的最后 256 个页面已全部在页面缓存中找到(连续缓存命中的值存储在该字段中ra->cache_hit)时,设置第二个标志。在这种情况下,预读被关闭,因为内核假设进程所需的所有页面已经在缓存中。
The flags field contains two
flags called RA_FLAG_MISS and
RA_FLAG_INCACHE that play an
important role. The first flag is set when a page that has been read
in advance is not found in the page cache (likely because it has been
reclaimed by the kernel in order to free memory; see Chapter 17): in this case, the
size of the next ahead window to be created is somewhat reduced. The
second flag is set when the kernel determines that the last 256 pages
requested by the process have all been found in the page cache (the
value of consecutive cache hits is stored in the ra->cache_hit field). In this case,
read-ahead is turned off because the kernel assumes that all the pages
required by the process are already in the cache.
预读算法何时执行?这种情况发生在以下情况:
When is the read-ahead algorithm executed? This happens in the following cases:
当内核处理用户模式请求以读取文件数据页时;该事件触发函数的调用(请参阅本章前面“从文件读取”部分中函数page_cache_readahead( )描述中的步骤 4c )。do_generic_file_read( )
When the kernel handles a User Mode request to read pages of
file data; this event triggers the invocation of the page_cache_readahead( ) function (see
step 4c in the description of the do_generic_file_read( ) function in the
section "Reading from
a File" earlier in this chapter).
当内核为文件内存映射分配页面时(请参阅本章后面
“内存映射的需求调页filemap_nopage( )”
部分中的函数,该函数再次调用该
函数)。page_cache_readahead( )
When the kernel allocates a page for a file memory mapping
(see the filemap_nopage( )
function in the section "Demand Paging for Memory
Mapping" later in this chapter, which again invokes the
page_cache_readahead( )
function).
When a User Mode application executes the readahead( ) system call, which explicitly triggers some
read-ahead activity on a file descriptor.
当用户模式应用程序posix_fadvise( )使用
POSIX_FADV_NOREUSE或POSIX_FADV_WILLNEED命令执行系统调用时,它通知内核不久将访问给定范围的文件页面。
When a User Mode application executes the posix_fadvise( ) system call with the
POSIX_FADV_NOREUSE or POSIX_FADV_WILLNEED commands, which
inform the kernel that a given range of file pages will be
accessed in the near future.
当用户模式应用程序执行madvise( ) 使用该命令进行系统调用MADV_WILLNEED,通知内核在不久的将来将访问文件内存映射区域中给定范围的页面。
When a User Mode application executes the madvise( ) system call with the MADV_WILLNEED command, which informs the
kernel that a given range of pages in a file memory mapping region
will be accessed in the near future.
该page_cache_readahead(
)函数负责处理所有未由临时系统调用显式触发的预读操作。它补充当前窗口和预读窗口,根据预读命中的次数(即根据预读策略在过去访问文件时的成功程度)更新它们的大小。
The page_cache_readahead(
) function takes care of all read-ahead operations that
are not explicitly triggered by ad-hoc system calls. It replenishes
the current and ahead windows, updating their sizes according to the
number of read-ahead hits, that is, according to how successful the
read-ahead strategy was in the past accesses to the file.
当内核必须满足对文件的一页或多页的读取请求时调用该函数,并作用于五个参数:
The function is invoked when the kernel must satisfy a read request for one or more pages of a file, and acts on five parameters:
mappingmapping指向address_space描述页面所有者的对象的指针
Pointer to the address_space object that describes
the owner of the page
rara指向file_ra_state包含该页面的文件描述符的指针
Pointer to the file_ra_state descriptor of the file
containing the page
filpfilp文件对象的地址
Address of the file object
offsetoffset文件内页面的偏移量
Offset of the page within the file
req_sizereq_size完成当前读取操作尚未读取的页数[ * ]
Number of pages yet to be read to complete the current read operation[*]
图 16-1
显示了 的流程图page_cache_readahead( )。该函数本质上作用于描述符的字段file_ra_state;因此,虽然流程图中的操作描述相当非正式,但您可以轻松确定该函数执行的实际步骤。例如,为了检查请求的页面是否与先前读取的页面相同,该函数检查该字段的值ra->prev_page和offset参数的值是否一致(参见前面的表16-3
)。
Figure 16-1
shows the flow diagram of page_cache_readahead( ). The function
essentially acts on the fields of the file_ra_state descriptor; thus, although
the description of the actions in the flow diagram is quite
informal, you can easily determine the actual steps performed by the
function. For instance, in order to check whether the requested page
is the same as the page previously read, the function checks whether
the values of the ra->prev_page field and of the offset parameter coincide (see Table 16-3
earlier).
当进程第一次访问文件并且第一个请求的页面是文件中偏移量为零的页面时,该函数假定进程将执行顺序访问。因此,该函数从第一页开始创建一个新的当前窗口。初始当前窗口的长度(始终为 2 的幂)与进程在第一次读取操作中请求的页面数有一定关系:请求的页面数越多,当前窗口越大,最多可达最大值值存储在ra->ra_pages场地。相反,当进程第一次访问文件但第一个请求的页面不在偏移量零处时,该函数假定进程不会执行顺序访问。因此,该函数暂时禁用预读(ra->size字段设置为-1)。但是,当函数识别出顺序访问且预读暂时禁用时,会创建一个新的当前窗口。
When the process accesses the file for the first time and the
first requested page is the page at offset zero in the file, the
function assumes that the process will perform sequential accesses.
Thus, the function creates a new current window starting from the
first page. The length of the initial current window—always a power
of two—is somewhat related to the number of pages requested by the
process in the first read operation: the higher the number of
requested pages, the larger the current window, up to the maximum
value stored in the ra->ra_pages field. Conversely, when
the process accesses the file for the first time but the first
requested page is not at offset zero, the function assumes that the
process will not perform sequential accesses. Thus, the function
temporarily disables read-ahead (ra->size field is set to -1). However, a new current window is
created when the function recognizes a sequential access while
read-ahead is temporarily disabled.
如果前置窗口尚不存在,则一旦函数识别出进程已在当前窗口中执行了顺序访问,就会创建该窗口。前向窗口始终从当前窗口最后一页的下一页开始。然而,它的长度与当前窗口的长度相关,如下所示:如果RA_FLAG_MISS
如果设置了flag,则前向窗口的长度是当前窗口的长度减2,如果结果小于四页,则为四页;否则,前向窗口的长度是当前窗口长度的四倍或两倍。如果进程继续以顺序方式访问文件,最终前置窗口将成为新的当前窗口,并创建新的前置窗口。因此,如果进程顺序读取文件,预读就会得到显着增强。
If the ahead window does not already exist, it is created as
soon as the function recognizes that the process has performed a
sequential access in the current window. The ahead window always
starts from the page following the last page of the current window.
Its length, however, is related to the length of the current window
as follows: if the RA_FLAG_MISS
flag is set, the length of the ahead window is the length of the
current window minus 2, or four pages if the result is less than
four; otherwise, the length of the ahead window is either four times
or two times the length of the current window. If the process
continues to access the file in a sequential way, eventually the
ahead window becomes the new current window, and a new ahead window
is created. Thus, read-ahead is aggressively enhanced if the process
reads the file sequentially.
一旦该函数识别出相对于前一个文件访问不连续的文件访问,当前窗口和前方窗口将被清除(清空),并且预读将暂时禁用。一旦进程执行相对于先前文件访问顺序的读取操作,预读就会从头开始重新启动。
As soon as the function recognizes a file access that is not sequential with respect to the previous one, the current and ahead windows are cleared (emptied) and the read-ahead is temporarily disabled. Read-ahead is restarted from scratch as soon as the process performs a read operation that is sequential with respect to the previous file access.
图 16-1。page_cache_readahead()函数的流程图
Figure 16-1. The flow diagram of the page_cache_readahead( ) function
每次page_cache_readahead(
)创建一个新窗口时,它都会启动所包含页面的读取操作。为了读取一大块页面,page_cache_readahead( )调用该
blockable_page_cache_readahead( )
函数。为了减少内核开销,后一个函数采用了以下巧妙的功能:
Every time page_cache_readahead(
) creates a new window, it starts the read operations for
the included pages. In order to read a chunk of pages, page_cache_readahead( ) invokes the
blockable_page_cache_readahead( )
function. To reduce kernel overhead, the latter function adopts the
following clever features:
如果为块设备提供服务的请求队列发生读拥塞(增加拥塞和块预读是没有意义的),则不会执行任何读取操作。
No reading is performed if the request queue that services the block device is read-congested (it does not make sense to increase congestion and block read-ahead).
针对要读取的每个页面检查页面缓存;如果页面已经在页面缓存中,则简单地跳过它。
The page cache is checked against each page to be read; if the page is already in the page cache, it is simply skipped over.
在执行从磁盘的读取之前,立即分配读取请求所需的所有页帧。如果无法获得所有页框,则仅对可用页进行预读操作。同样,推迟预读直到所有页框都可用是没有意义的。
All the page frames needed by the read request are allocated at once before performing the read from disk. If not all page frames can be obtained, the read-ahead operation is performed only on the available pages. Again, there is little sense in deferring read-ahead until all page frames become available.
只要有可能,读操作都会通过使用多段bio描述符提交到通用块层(参见第14章中的“段”部分)。
这是通过对象的专门方法
(如果已定义)完成的;否则,通过重复调用该方法来完成。该
方法在前面的“从文件读取”部分中仅针对单段情况进行了描述,但是很容易适应多段情况的描述。readpagesaddress_spacereadpagereadpage
Whenever possible, the read operations are submitted to
the generic block layer by using multi-segment bio descriptors
(see the section "Segments" in Chapter 14). This is done
by the specialized readpages
method of the address_space
object, if defined; otherwise, it is done by repeatedly invoking
the readpage method. The
readpage method is described
in the earlier section "Reading from a
File" for the single-segment case only, but it is easy to
adapt the description for the multi-segment case.
在某些情况下,内核必须更正预读参数,因为预读策略看起来不是很有效。让我们考虑一下本章前面“从文件中读取do_generic_file_read( )”一节中描述的函数。该函数在步骤 4c 中调用。图16-1中的流程图描述了两种情况:请求的页面要么在当前窗口中,要么在前一个窗口中,因此应该提前读取,或者没有,并且调用函数来读取它。在这两种情况下,page_cache_readahead( )blockable_page_cache_readahead( )do_generic_file_read(
)应该在步骤 4d 中在页面缓存中找到该页面。如果没有找到,这意味着页框回收算法已将该页从缓存中删除。在这种情况下,do_generic_file_read( )调用该
函数,该函数通过设置标志和清除标志来
handle_ra_miss( )调整预读算法。RA_FLAG_MISSRA_FLAG_INCACHE
In some cases, the kernel must correct the read-ahead
parameters, because the read-ahead strategy does not seem very
effective. Let us consider the do_generic_file_read( ) function described
in the section "Reading
from a File" earlier in this chapter. The page_cache_readahead( ) function is
invoked in step 4c. The flow diagram in Figure 16-1 depicts two
cases: either the requested page is in the current window or in the
ahead window, hence it should have been read in advance, or it is
not, and the function invokes blockable_page_cache_readahead( ) to read
it. In both cases, do_generic_file_read(
) should find the page in the page cache in step 4d. If it
is not found, this means that the page frame reclaiming algorithm
has removed the page from the cache. In this case, do_generic_file_read( ) invokes the
handle_ra_miss( ) function, which
tunes the read-ahead algorithm by setting the RA_FLAG_MISS flag and by clearing the
RA_FLAG_INCACHE flag.
回想一下,write(
)系统调用涉及将数据从调用进程的用户模式地址空间移动到内核数据结构,然后移动到磁盘。文件对象的方法write允许每种文件系统类型定义专门的写操作。在Linux 2.6中,write每个基于磁盘的文件系统的方法都是一个过程,基本上识别写入操作所涉及的磁盘块,将数据从用户模式地址空间复制到属于页面缓存的一些页面中,并在其中标记缓冲区这些页面很脏。
Recall that the write(
) system call involves moving data from the User Mode
address space of the calling process into the kernel data structures,
and then to disk. The write method
of the file object permits each filesystem type to define a
specialized write operation. In Linux 2.6, the write method of each disk-based filesystem
is a procedure that basically identifies the disk blocks involved in
the write operation, copies the data from the User Mode address space
into some pages belonging to the page cache, and marks the buffers in
those pages as dirty.
许多文件系统(包括 Ext2 或 JFSwrite)通过函数的方式实现文件对象的方法generic_file_write( ),该函数作用于以下参数:
Many filesystems (including Ext2 or JFS ) implement the write method of the file object by means of
the generic_file_write( ) function,
which acts on the following parameters:
filefile文件对象指针
File object pointer
bufbuf用户模式地址空间中的地址,必须从中获取要写入文件的字符
Address in the User Mode address space where the characters to be written into the file must be fetched
countcount要写入的字符数
Number of characters to be written
pposppos存储必须开始写入的文件偏移量的变量的地址
Address of a variable storing the file offset from which writing must start
该函数执行以下步骤:
The function performs the following steps:
初始化包含用户模式缓冲区的地址和长度的类型的局部变量(另请参阅本章前面的“从文件读取”部分中iovec的函数描述)。generic_file_read( )
Initializes a local variable of type iovec containing the address and length
of the User Mode buffer (see also the description of the generic_file_read( ) function in the
section "Reading from
a File" earlier in this chapter).
inode确定要写入的文件对应的 inode 对象的地址( file->f_mapping->host) 并获取信号量inode->i_sem。由于这个信号量,一次只有一个进程可以对write( )文件发出系统调用。
Determines the address inode of the inode object that
corresponds to the file to be written (file->f_mapping->host) and
acquires the semaphore inode->i_sem. Thanks to this
semaphore, only one process at a time can issue a write( ) system call on the file.
调用init_sync_kiocb宏来初始化类型的局部变量kiocb。正如本章前面的“从文件读取”一节中所解释的,宏将字段设置ki_key为KIOCB_SYNC_KEY(同步 I/O 操作),将ki_filp字段设置为file,将ki_obj字段设置为current。
Invokes the init_sync_kiocb macro to initialize a
local variable of type kiocb.
As explained in the section "Reading from a File"
earlier in this chapter, the macro sets the ki_key field to KIOCB_SYNC_KEY (synchronous I/O
operation), the ki_filp field
to file, and the ki_obj field to current.
调用_
_generic_file_aio_write_nolock( )(见下文)将受影响的页面标记为脏页,传递类型为 的局部变量的地址iovec和
kiocb、用户模式缓冲区的段数(在本例中只有一个)以及参数
ppos。
Invokes _
_generic_file_aio_write_nolock( ) (see below) to mark
the affected pages as dirty, passing the address of the local
variables of type iovec and
kiocb, the number of segments
for the User Mode buffer—only one in this case—and the parameter
ppos.
释放inode->i_sem信号量。
Releases the inode->i_sem semaphore.
检查O_SYNC文件标志、S_SYNCinode标志、MS_SYNCHRONOUS超级块标志;如果至少设置了其中之一,则调用该sync_page_range( )函数强制内核刷新页面缓存中在步骤 4 中已触及的所有页面,阻塞当前进程,直到 I/O 数据传输终止。依次sync_page_range( )执行
对象writepages的方法
address_space(如果已定义)或函数(请参阅本章后面的“将脏页写入磁盘mpage_writepages(
)”部分)以启动脏页的 I/O 传输;然后,它调用generic_osync_inode( )将 inode 和关联的缓冲区刷新到磁盘,最后调用wait_on_page_bit( )挂起当前进程,直到PG_writeback清除刷新页面的所有位。
Checks the O_SYNC flag of
the file, the S_SYNC flag of
the inode, and the MS_SYNCHRONOUS flag of the superblock;
if at least one of them is set, it invokes the sync_page_range( ) function to force the
kernel to flush all pages in the page cache that have been touched
in step 4, blocking the current process until the I/O data
transfers terminate. In turn, sync_page_range( ) executes either the
writepages method of the
address_space object, if
defined, or the mpage_writepages(
) function (see the section "Writing Dirty Pages to
Disk" later in this chapter) to start the I/O transfers for
the dirty pages; then, it invokes generic_osync_inode( ) to flush to disk
the inode and the associated buffers, and finally invokes wait_on_page_bit( ) to suspend the
current process until all PG_writeback bits of the flushed pages
are cleared.
返回 __ 返回的代码generic_file_aio_write_nolock( ),通常是有效写入的字节数。
Returns the code returned by _ _generic_file_aio_write_nolock( ),
usually the number of bytes effectively written.
该函数接收四个参数:描述符的_ _generic_file_aio_write_nolock(
)地址、描述符数组的地址、数组的长度以及存储文件当前指针的变量的地址。当由 调用时,描述符数组仅由一个元素组成,描述包含要写入的数据的用户模式缓冲区。[ * ]iocbkiocbioviovecpposgeneric_file_write( )iovec
The _ _generic_file_aio_write_nolock(
) function receives four parameters: the address iocb of a kiocb descriptor, the address iov of an array of iovec descriptors, the length of this array,
and the address ppos of a variable
that stores the file's current pointer. When invoked by generic_file_write( ), the array of iovec descriptors is composed of just one
element describing the User Mode buffer that contains the data to be
written.[*]
现在我们解释 __generic_file_aio_write_nolock( )函数的动作;为了简单起见,我们将描述限制在最常见的情况:由write( )页面缓存文件上的系统调用引发的常见模式操作。在本章后面,我们将描述该函数在其他情况下的行为方式。像往常一样,我们不讨论如何处理错误和异常情况。
We now explain the actions of the _ _generic_file_aio_write_nolock( ) function;
for the sake of simplicity, we restrict the description to the most
common case: a common mode operation raised by a write( ) system call on a page-cached file.
Later in this chapter we describe how this function behaves in other
cases. As usual, we do not discuss how errors and anomalous conditions
are handled.
该函数执行以下步骤:
The function executes the following steps:
调用access_ok( )以验证描述符描述的用户模式缓冲区iovec是否有效(起始地址和长度已从sys_write( )服务例程接收,因此在使用它们之前必须检查它们;请参阅第10章中的“验证参数”部分)。如果参数无效,则返回
错误代码。-EFAULT
Invokes access_ok( ) to
verify that the User Mode buffer described by the iovec descriptor is valid (the starting
address and length have been received from the sys_write( ) service routine, thus they
must be checked before using them; see the section "Verifying the
Parameters" in Chapter
10). If the parameters are not valid, it returns the
-EFAULT error code.
inode确定要写入的文件对应的 inode 对象的地址( file->f_mapping->host)。请记住,如果文件是块设备文件,则这是 bdev 中的
inode 特殊文件系统(参见第 14 章)。
Determines the address inode of the inode object that
corresponds to the file to be written (file->f_mapping->host). Remember
that if the file is a block device file, this is an inode in the
bdev special filesystem (see Chapter 14).
设置为
文件描述符current->backing_dev_info的地址( )。本质上,这个设置允许当前进程即使对应的请求队列拥塞,也可以写回所拥有的脏页;参见第 17 章。backing_dev_infofile->f_mapping->backing_dev_infofile->f_mapping
Sets current->backing_dev_info to the
address of the backing_dev_info
descriptor of the file (file->f_mapping->backing_dev_info).
Essentially, this setting allows the current process to write back
the dirty pages owned by file->f_mapping even if the
corresponding request queue is congested; see Chapter 17.
如果O_APPEND的标志
file->flags打开并且文件是常规文件(不是块设备文件),则它设置*ppos为文件末尾,以便将所有新数据附加到其中。
If the O_APPEND flag of
file->flags is on and the
file is regular (not a block device file), it sets *ppos to the end of the file so that all
new data is appended to it.
对文件的大小执行多次检查。例如,写操作不得将常规文件放大到超过存储在 中的每用户限制(请参阅第 3 章中的“进程资源限制”current->signal->rlim[RLIMIT_FSIZE]
部分)和存储在 中的文件系统限制。而且,如果文件不是“大文件”(已清除标志),则其大小不能超过 2 GB。如果不强制执行任何这些约束,则会减少要写入的字节数。inode->i_sb->s_maxbytesO_LARGEFILEfile->f_flags
Performs several checks on the size of the file. For
instance, the write operation must not enlarge a regular file so
much as to exceed the per-user limit stored in current->signal->rlim[RLIMIT_FSIZE]
(see the section "Process Resource
Limits" in Chapter
3) and the filesystem limit stored in inode->i_sb->s_maxbytes. Moreover,
if the file is not a "large file" (flag O_LARGEFILE of file->f_flags cleared), its size
cannot exceed 2 GB. If any of these constraints is not enforced,
it reduces the number of bytes to be written.
如果设置,它会清除suid文件的标志;sgid如果文件可执行,也会清除该
标志(请参阅第 1 章中的“访问权限和文件模式”部分)。我们不希望用户能够修改
setuid文件。
If set, it clears the suid flag of the file; also clears the
sgid flag if the file is
executable (see the section "Access Rights and File
Mode" in Chapter
1). We don't want users to be able to modify
setuid files.
将当前时间存储在字段inode->mtime(上次文件写入操作的时间)和字段inode->ctime(上次 inode 更改的时间)中,并将 inode 对象标记为脏。
Stores the current time of day in the inode->mtime field (the time of last
file write operation) and in the inode->ctime field (the time of last
inode change), and marks the inode object as dirty.
启动一个循环来更新写操作涉及的文件的所有页面。在每次迭代期间,它执行以下子步骤:
调用find_lock_page(
)在页面缓存中搜索页面(请参阅第 15 章中的“页面缓存处理函数”部分)。如果此函数找到该页面,则会增加其使用计数器并设置其
标志。PG_locked
如果页面不在页面缓存中,则分配一个新的页框,并调用add_to_page_cache( )将该页面插入页面缓存中;正如第 15 章“页面缓存处理函数”部分所述,该函数还增加使用计数器并设置标志。此外,该函数将新页插入到内存区域的非活动列表中(参见第17章)。PG_locked
调用inode ( ) 对象prepare_write的方法。相应的函数负责分配和初始化页面的缓冲区头。我们将在后续部分中讨论此函数对常规文件和块设备文件的作用。address_spacefile->f_mapping
如果缓冲区位于高端内存中,则它会建立用户模式缓冲区的内核映射(请参阅第 8 章中的“高端内存页帧的内核映射”部分)。然后,它调用将字符从用户模式缓冲区复制到页面,并释放内核映射。_ _copy_from_user(
)
调用inode ( ) 对象commit_write的方法。相应的函数将底层缓冲区标记为脏缓冲区,以便稍后将它们写入磁盘。我们将在接下来的两节中讨论此函数对常规文件和块设备文件的作用。address_spacefile->f_mapping
调用unlock_page(
)以清除PG_locked标志并唤醒任何正在等待页面的进程。
调用以更新内存回收算法的页面状态(请参阅第 17 章中的“最近最少使用(LRU)列表”mark_page_accessed(
)部分)。
减少页面使用计数器以撤消步骤 8a 或 8b 中的增量。
在本次迭代中,又一个页面被脏了:它检查页面缓存中脏页面的比例是否超过固定阈值(通常为系统中页面的40%);如果是这样,它会调用writeback_inodes( )开始将几十个页面刷新到磁盘(请参阅第 15 章中的“查找要刷新的脏页”部分)。
调用cond_resched(
)以检查TIF_NEED_RESCHED当前进程的标志,如果设置了标志,则调用该schedule( )函数。
Starts a cycle to update all the pages of the file involved in the write operation. During each iteration, it performs the following substeps:
Invokes find_lock_page(
) to search the page in the page cache (see the
section "Page
Cache Handling Functions" in Chapter 15). If this
function finds the page, it increases its usage counter and
sets its PG_locked
flag.
If the page is not in the page cache, it allocates a new
page frame and invokes add_to_page_cache( ) to insert the
page into the page cache; as explained in the section "Page Cache Handling
Functions" in Chapter 15, this function
also increases the usage counter and sets the PG_locked flag. Moreover, the
function inserts the new page into the inactive list of the
memory zone (see Chapter
17).
Invokes the prepare_write method of the address_space object of the inode
(file->f_mapping). The
corresponding function takes care of allocating and
initializing buffer heads for the page. We'll discuss in
subsequent sections what this function does for regular files
and block device files.
If the buffer is in high memory, it establishes a kernel
mapping of the User Mode buffer (see the section "Kernel Mappings of
High-Memory Page Frames" in Chapter 8). Then, it
invokes _ _copy_from_user(
) to copy the characters from the User Mode buffer
to the page, and releases the kernel mapping.
Invokes the commit_write method of the address_space object of the inode
(file->f_mapping). The
corresponding function marks the underlying buffers as dirty
so they are written to disk later. We discuss what this
function does for regular files and block device files in the
next two sections.
Invokes unlock_page(
) to clear the PG_locked flag and wake up any
process that is waiting for the page.
Invokes mark_page_accessed(
) to update the page status for the memory
reclaiming algorithm (see the section "The Least Recently Used
(LRU) Lists" in Chapter 17).
Decreases the page usage counter to undo the increment in step 8a or 8b.
In this iteration, yet another page has been dirtied: it
checks whether the ratio of dirty pages in the page cache has
risen above a fixed threshold (usually, 40% of the pages in
the system); if so, it invokes writeback_inodes( ) to start
flushing a few tens of pages to disk (see the section "Looking for Dirty Pages
To Be Flushed" in Chapter 15).
Invokes cond_resched(
) to check the TIF_NEED_RESCHED flag of the current
process and, if the flag is set, to invoke the schedule( ) function.
现在,写入操作涉及的文件的所有页面均已处理。*ppos在写入最后一个字符后立即更新 to 的值。
Now all pages of the file involved in the write operation
have been handled.Updates the value of *ppos to point right after the last
character written.
设置current->backing_dev_info为NULL(参见步骤 3)。
Sets current->backing_dev_info to NULL (see step 3).
通过返回有效写入的字节数来终止。
Terminates by returning the number of bytes effectively written.
该对象的prepare_write
和commit_write方法
address_space专门针对常规文件和块设备文件实现的通用写入操作generic_file_write( )。对于受写入操作影响的文件的每个页面,它们都会被调用一次。
The prepare_write
and commit_write methods of the
address_space object specialize
the generic write operation implemented by generic_file_write( ) for regular files
and block device files. Both of them are invoked once for every page
of the file that is affected by the write operation.
每个基于磁盘的文件系统都定义了自己的prepare_write方法。与读取操作一样,此方法只是通用函数的包装器。例如,Ext2 文件系统prepare_write通常通过以下函数来实现该方法:
Each disk-based filesystem defines its own prepare_write method. As with read
operations, this method is simply a wrapper for a common function.
For instance, the Ext2 filesystem usually implements the prepare_write method by means of the
following function:
int ext2_prepare_write(结构文件*文件,结构页面*页面,
未签名者,未签名者)
{
返回 block_prepare_write(page, from, to, ext2_get_block);
}int ext2_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
return block_prepare_write(page, from, to, ext2_get_block);
}该功能在前面的“从文件读取ext2_get_block( )”
一节中已经提到过;它将相对于文件的块号转换为逻辑块号,该逻辑块号代表数据在物理块设备上的位置。
The ext2_get_block( )
function was already mentioned in the earlier section "Reading from a File";
it translates the block number relative to the file into a logical
block number, which represents the position of the data on the
physical block device.
该block_prepare_write( )
函数主要通过执行以下步骤来准备文件页面的缓冲区和缓冲区头:
The block_prepare_write( )
function takes care of preparing the buffers and the buffer heads of
the file's page by performing essentially the following
steps:
检查该页是否为缓冲页(PG_Private设置了标志);如果清除此标志,则调用create_empty_buffers(
)为页面中包含的所有缓冲区分配缓冲区头(请参阅第 15 章中的“缓冲区页面”
部分)。
Checks if the page is a buffer page (flag PG_Private set); if this flag is
cleared, invokes create_empty_buffers(
) to allocate buffer heads for all buffers included in
the page (see the section "Buffer Pages" in
Chapter 15).
对于与页面中包含的并受写操作影响的缓冲区相关的每个缓冲区头,执行以下操作:
如果设置了标志,则重置BH_New
该标志(见下文)。
如果BH_Mapped
未设置该标志,该函数将执行以下子步骤:
调用与文件系统相关的函数,其地址get_block作为参数传递。该函数查找文件系统的磁盘数据结构并查找缓冲区的逻辑块号(相对于磁盘分区的开头而不是常规文件的开头)。依赖于文件系统的函数将此数字存储在b_blocknr相应缓冲区头的字段中并设置其BH_Mapped标志。这get_block函数可以为文件分配一个新的物理块(例如,如果访问的块落入常规文件的“洞”内;请参阅第 18 章中的“文件洞”部分)。在这种情况下,它设置标志。BH_New
检查标志的值BH_New;如果设置了,则调用unmap_underlying_metadata( )
检查页面缓存中的某个块设备缓冲区页面是否包含引用磁盘上同一块的缓冲区。[ * ]该函数本质上是调用_ _find_get_block( )在页缓存中查找旧块(请参阅第 15 章中的“在页缓存中搜索块”部分)。如果找到这样的块,该函数将清除其BH_Dirty标志并等待该缓冲区上的任何 I/O 数据传输完成。此外,如果写入操作没有重写页面中的整个缓冲区,则会用零填充未写入的部分。然后它考虑页面中的下一个缓冲区。
如果写操作没有重写整个缓冲区并且未设置其BH_Delay
和BH_Uptodate标志(即该块已在磁盘文件系统数据结构中分配并且 RAM 中的缓冲区不包含数据的有效映像),该函数调用块以从磁盘读取其内容(请参阅第 15 章中的“将缓冲区头提交到通用块层”ll_rw_block( )部分)。
For each buffer head relative to a buffer included in the page and affected by the write operation, the following is performed:
Resets the BH_New
flag, if it is set (see below).
If the BH_Mapped
flag is not set, the function performs the following
substeps:
Invokes the filesystem-dependent function whose
address get_block was
passed as a parameter. This function looks in the
on-disk data structures of the filesystem and finds the
logical block number of the buffer (relative to the
beginning of the disk partition rather than the
beginning of the regular file). The filesystem-dependent
function stores this number in the b_blocknr field of the
corresponding buffer head and sets its BH_Mapped flag. The get_block function could
allocate a new physical block for the file (for
instance, if the accessed block falls inside a "hole" of
the regular file; see the section "File
Holes" in Chapter 18). In
this case, it sets the BH_New flag.
Checks the value of the BH_New flag; if it is set,
invokes unmap_underlying_metadata( )
to check whether some block device buffer page in the
page cache includes a buffer referencing the same block
on disk.[*] This function essentially invokes _ _find_get_block( ) to look
up the old block in the page cache (see the section
"Searching
Blocks in the Page Cache" in Chapter 15). If
such a block is found, the function clears its BH_Dirty flag and waits until
any I/O data transfer on that buffer completes.
Moreover, if the write operation does not rewrite the
whole buffer in the page, it fills the unwritten portion
with zero's. Then it considers the next buffer in the
page.
If the write operation does not rewrite the whole
buffer and its BH_Delay
and BH_Uptodate flags are
not set (that is, the block has been allocated in the
on-disk filesystem data structures and the buffer in RAM
does not contain a valid image of the data), the function
invokes ll_rw_block( ) on
the block to read its content from disk (see the section
"Submitting
Buffer Heads to the Generic Block Layer" in Chapter 15).
阻止当前进程,直到完成步骤 2c 中触发的所有读取操作。
Blocks the current process until all read operations triggered in step 2c have been completed.
返回 0。
Returns 0.
一旦该prepare_write
方法返回,该generic_file_write(
)函数就会使用用户模式地址空间中存储的数据更新页面。接下来,它调用commit_write该对象的方法address_space。generic_commit_write(
)该方法由几乎所有基于磁盘的非日志文件系统的函数实现。
Once the prepare_write
method returns, the generic_file_write(
) function updates the page with the data stored in the
User Mode address space. Next, it invokes the commit_write method of the address_space object. This method is
implemented by the generic_commit_write(
) function for almost all disk-based non-journaling
filesystems.
该generic_commit_write( )
函数主要执行以下步骤:
The generic_commit_write( )
function performs essentially the following steps:
调用该_
_block_commit_write( )函数。反过来,该函数执行以下操作:
考虑页面中所有受写操作影响的缓冲区;对于每个缓冲区,设置相应缓冲区头的BH_Uptodate和标志。BH_Dirty
将相应的 inode 标记为脏。正如第 15 章中的“查找要刷新的脏页”部分所示,此活动可能需要将 inode 添加到超级块的脏 inode 列表中。
如果缓冲区页面中的所有缓冲区现在都是最新的,则它会设置PG_uptodate
该页面的标志。
Invokes the _
_block_commit_write( ) function. In turn, this
function does the following:
Considers all buffers in the page that are affected by
the write operation; for each of them, sets the BH_Uptodate and BH_Dirty flags of the
corresponding buffer head.
Marks the corresponding inode as dirty. As seen in the section "Looking for Dirty Pages To Be Flushed" in Chapter 15, this activity may require adding the inode to the list of dirty inodes of the superblock.
If all buffers in the buffer page are now up-to-date,
it sets the PG_uptodate
flag of the page.
Sets the PG_dirty
flag of the page, and tags the page as dirty in its radix
tree (see the section "The Radix
Tree" in Chapter
15).
检查写入操作是否扩大了文件。在这种情况下,该函数会更新i_size文件 inode 的字段。
Checks whether the write operation enlarged the file. In
this case, the function updates the i_size field of the file's
inode.
返回 0。
Returns 0.
块设备文件的写入操作与常规文件的相应操作非常相似。事实上,块设备文件对象prepare_write的方法
address_space通常由以下函数实现:
Write operations into block device files are very
similar to the corresponding operations on regular files. In fact,
the prepare_write method of the
address_space object of block
device files is usually implemented by the following
function:
int blkdev_prepare_write(结构文件*文件,结构页面*页面,
未签名者,未签名者)
{
返回 block_prepare_write(page, from, to, blkdev_get_block);
}int blkdev_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
return block_prepare_write(page, from, to, blkdev_get_block);
}正如您所看到的,该函数只是
block_prepare_write( )上一节中已讨论的函数的包装器。当然,唯一的区别在于第二个参数,它指向必须将相对于文件开头的文件块号转换为相对于块设备开头的逻辑块号的函数。请记住,对于块设备文件,这两个数字是一致的。(有关该函数的讨论,请参阅前面的“从文件中读取”部分blkdev_get_block(
)。)
As you can see, the function is simply a wrapper to the
block_prepare_write( ) function
already discussed in the previous section. The only difference, of
course, is in the second parameter, which points to the function
that must translate the file block number relative to the beginning
of the file to a logical block number relative to the beginning of
the block device. Remember that for block device files, the two
numbers coincide. (See the earlier section "Reading from a File"
for a discussion of the blkdev_get_block(
) function.)
块设备文件的方法commit_write是通过以下简单的包装函数实现的:
The commit_write method for
block device files is implemented by the following simple wrapper
function:
int blkdev_commit_write(结构文件*文件,结构页面*页面,
未签名者,未签名者)
{
return block_commit_write(页, from, to);
}int blkdev_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
return block_commit_write(page, from, to);
}正如您所看到的,commit_write块设备文件的方法与常规文件的方法基本上执行相同的操作commit_write(我们block_commit_write(
)在上一节中描述了该函数)。唯一的区别是该方法不检查写操作是否放大了文件;您根本无法通过将字符附加到块设备文件的最后位置来扩大块设备文件。
As you can see, the commit_write method for block device files
does essentially the same things as the commit_write method for regular files (we
described the block_commit_write(
) function in the previous section). The only difference
is that the method does not check whether the write operation has
enlarged the file; you simply cannot enlarge a block device file by
appending characters to its last position.
系统调用的最终效果write(
)包括修改页面缓存中某些页面的内容 - 可选地分配这些页面并将它们添加到页面缓存(如果它们尚不存在)。在某些情况下(例如,如果文件已使用标志打开O_SYNC),I/O 数据传输会立即开始(请参阅本章前面“写入文件generic_file_write( )”部分中的步骤 6)。然而,通常情况下,I/O 数据传输会被延迟,如第 15 章“将脏页写入磁盘”部分所述。
The net effect of the write(
) system call consists of modifying the contents of some
pages in the page cache—optionally allocating the pages and adding
them to the page cache if they were not already present. In some cases
(for instance, if the file has been opened with the O_SYNC flag), the I/O data transfers start
immediately (see step 6 of generic_file_write( ) in the section "Writing to a File"
earlier in this chapter). Usually, however, the I/O data transfer is
delayed, as explained in the section "Writing Dirty Pages to
Disk" in Chapter
15.
当内核想要有效地启动 I/O 数据传输时,它最终会调用writepages文件address_space对象的方法,该方法在基数树中搜索脏页并将其刷新到磁盘。例如,Ext2 文件系统writepages通过以下函数实现该方法:
When the kernel wants to effectively start the I/O data
transfer, it ends up invoking the writepages method of the file's address_space object, which searches for
dirty pages in the radix-tree and flushes them to disk. For instance,
the Ext2 filesystem implements the writepages method by means of the following
function:
int ext2_writepages(结构地址空间*映射,
结构 writeback_control *wbc)
{
返回 mpage_writepages(映射, wbc, ext2_get_block);
}int ext2_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
return mpage_writepages(mapping, wbc, ext2_get_block);
}如您所见,该函数是通用函数的简单包装器mpage_writepages( )
;事实上,如果文件系统没有定义该
writepages方法,内核将直接mpage_writepages(
)调用NULL作为第三个参数传递。该ext2_get_block( )
函数已在前面的“从文件读取”一节中提到过;它是与文件系统相关的函数,将文件块号转换为逻辑块号。
As you can see, this function is a simple wrapper for the
general-purpose mpage_writepages( )
function; as a matter of fact, if a filesystem does not define the
writepages method, the kernel
invokes directly mpage_writepages(
) passing NULL as third
argument. The ext2_get_block( )
function was already mentioned in the earlier section "Reading from a File;" it
is the filesystem-dependent function that translates a file block
number into a logical block number.
数据writeback_control结构是一个描述符,控制如何执行写回操作;我们已经在第15章“寻找要刷新的脏页”一节中描述过。
The writeback_control data
structure is a descriptor that controls how the writeback operation
has to be performed; we have already described it in the section
"Looking for Dirty Pages
To Be Flushed" in Chapter
15.
该mpage_writepages( )
函数主要执行以下操作:
The mpage_writepages( )
function essentially performs the following actions:
如果请求队列写入拥塞并且进程不想阻塞,则它会返回而不将任何页面写入磁盘。
If the request queue is write-congested and the process does not want to block, it returns without writing any page to disk.
确定要考虑的文件的初始页面。如果
writeback_control描述符指定文件中的初始位置,则该函数将其转换为页面索引。否则,如果writeback_control描述符指定进程不想等待 I/O 数据传输完成,则它将初始页索引设置为存储的值(即
mapping->writeback_index
从上一次回写中考虑的最后一页开始扫描)手术)。最后,如果进程必须等到 I/O 数据传输完成,则从文件的第一页开始扫描。
Determines the file's initial page to be considered. If the
writeback_control descriptor
specifies the initial position in the file, the function
translates it into a page index. Otherwise, if the writeback_control descriptor specifies
that the process does not want to wait for the I/O data transfer
to complete, it sets the initial page index to the value stored in
mapping->writeback_index
(that is, scanning begins from the last page considered in the
previous writeback operation). Finally, if the process must wait
until I/O data transfers complete, scanning starts from the first
page of the file.
Invokes find_get_pages_tag(
) to look up the descriptor of the dirty pages in the
page cache (see the section "The Tags of the Radix
Tree" in Chapter
15).
对于上一步中检索到的每个页面描述符,该函数执行以下步骤:
调用lock_page( )
锁定页面。
检查页面是否仍然有效并且位于页面缓存中(因为另一个内核控制路径可能已在步骤 3 和 4a 之间对该页面进行操作)。
检查PG_writeback页面的标志。如果已设置,则页面已被刷新到磁盘。如果进程必须等待I/O数据传输完成,则调用wait_on_page_bit( )
阻塞当前进程,直到PG_writeback标志被清除;当此函数终止时,任何先前正在进行的写回操作都会终止。否则,如果进程不想等待,它会检查该PG_dirty标志:如果现在已清除该标志,则正在进行的写回将处理该页,从而解锁该页并跳回步骤 4a 继续处理下一页。
如果get_block
参数为NULL(未定义
writepages方法),则调用文件对象mapping->writepage的方法
将页面刷新到磁盘。address_space否则,如果get_block参数不是NULL,则调用该mpage_writepage( )函数。详细信息请参见步骤 8。
For each page descriptor retrieved in the previous step, the function performs the following steps:
Invokes lock_page( )
to lock up the page.
Checks that the page is still valid and in the page cache (because another kernel control path could have acted upon the page between steps 3 and 4a).
Checks the PG_writeback flag of the page. If it
is set, the page is already being flushed to disk. If the
process must wait for the I/O data transfer to complete, it
invokes wait_on_page_bit( )
to block the current process until the PG_writeback flag is cleared; when
this function terminates, any previously ongoing writeback
operation is terminated. Otherwise, if the process does not
want to wait, it checks the PG_dirty flag: if it is now cleared,
the on-going writeback will take care of the page, thus
unlocks it and jumps back to step 4a to continue with the next
page.
If the get_block
parameter is NULL (no
writepages method defined),
it invokes the mapping->writepage method of the
address_space object of the
file to flush the page to disk. Otherwise, if the get_block parameter is not NULL, it invokes the mpage_writepage( ) function. See
step 8 for details.
调用cond_resched( )
以检查TIF_NEED_RESCHED
当前进程的标志,如果设置了标志,则调用该
schedule( )函数。
Invokes cond_resched( )
to check the TIF_NEED_RESCHED
flag of the current process and, if the flag is set, to invoke the
schedule( ) function.
如果该函数尚未扫描给定范围内的所有页,或者有效写入磁盘的页数小于描述符中最初指定的值writeback_control,则跳回步骤 3。
If the function has not scanned all pages in the given
range, or if the number of pages effectively written to disk is
smaller than the value originally specified in the writeback_control descriptor, it jumps
back to step 3.
如果writeback_control
描述符未指定文件中的初始位置,则会mapping->writeback_index使用最后扫描页面的索引来设置该字段。
If the writeback_control
descriptor does not specify the initial position in the file, it
sets the mapping->writeback_index field with
the index of the last scanned page.
如果该mpage_writepage( )
函数已在步骤 4d 中调用,并且该函数返回生物描述符的地址,则它会调用mpage_bio_submit( )(见下文)。
If the mpage_writepage( )
function has been invoked in step 4d, and if that function
returned the address of a bio descriptor, it invokes mpage_bio_submit( ) (see below).
诸如 Ext2 之类的典型文件系统将该writepage方法实现为通用函数的包装器block_write_full_page(
),并将与文件系统相关的get_block
函数的地址传递给它。反过来,该block_write_full_page(
)函数类似于本章前面的
“从文件读取block_read_full_page( )”部分中描述的:它为页面分配缓冲区头(如果该页还不是缓冲区页),并在每个缓冲区上调用该函数,指定
操作。就块设备文件而言,它们通过使用 来实现该方法
,该方法是.submit_bh( )WRITEwritepageblkdev_writepage( )block_write_full_page(
)
A typical filesystem such as Ext2 implements the writepage method as a wrapper for the
general-purpose block_write_full_page(
) function, passing to it the address of the
filesystem-dependent get_block
function. In turn, the block_write_full_page(
) function is similar to block_read_full_page( ) described in the
section "Reading from a
File" earlier in this chapter: it allocates buffer heads for
the page (if the page was not already a buffer page), and invokes the
submit_bh( ) function on each of
them, specifying the WRITE
operation. As far as block device files are concerned, they implement
the writepage method by using
blkdev_writepage( ), which is a
wrapper for block_write_full_page(
).
许多非日志文件系统依赖于该mpage_writepage( )函数而不是自定义writepage方法。这可以提高性能,因为该mpage_writepage( )函数尝试通过在同一生物描述符中收集尽可能多的页面来提交 I/O 传输;反过来,这允许块设备驱动程序利用现代硬盘控制器的分散-聚集 DMA 功能。
Many non-journaling filesystems rely on the mpage_writepage( ) function rather than on
the custom writepage method. This
can improve performance because the mpage_writepage( ) function tries to submit
the I/O transfers by collecting as many pages as possible in the same
bio descriptor; in turn, this allows the block device drivers to
exploit the scatter-gather DMA capabilities of the modern hard disk
controllers.
长话短说,该mpage_writepage( )函数检查要写入的页面是否包含与磁盘不相邻的块,或者该页面是否包含文件空洞,或者页面上的某些块是否不脏或不是最新的。日期。如果至少满足这些条件之一,则该函数将依赖于文件系统相关的
writepage方法,如上所述。否则,该函数将页面添加为生物描述符的一段。Bio描述符的地址作为参数传递给函数;如果是NULL,
mpage_writepage( )则初始化一个新的生物描述符并将其地址返回给调用函数,调用函数又将其在以后的调用中传回mpage_writepage( )。这样,可以将多个页面添加到同一个人简介中。如果页面与bio中最后添加的页面不相邻,mpage_writepage(
)则调用mpage_bio_submit(
)以启动bio上的I/O数据传输,并为该页面分配新的bio。
To make a long story short, the mpage_writepage( ) function checks whether
the page to be written contains blocks that are not adjacent to disk,
or whether the page includes a file hole, or whether some block on the
page is not dirty or not up-to-date. If at least one of these
conditions holds, the function falls back on the filesystem-dependent
writepage method, as above.
Otherwise, the function adds the page as a segment of a bio
descriptor. The address of the bio descriptor is passed as parameter
to the function; if it is NULL,
mpage_writepage( ) initializes a
new bio descriptor and returns its address to the calling function,
which in turn passes it back in the future invocations of mpage_writepage( ). In this way, several
pages can be added to the same bio. If a page is not adjacent to the
last added page in the bio, mpage_writepage(
) invokes mpage_bio_submit(
) to start the I/O data transfer on the bio, and allocates a
new bio for the page.
该mpage_bio_submit( )
函数将bio的方法设置bi_end_io为地址mpage_end_io_write( ),然后调用submit_bio( )以开始传输(请参阅第15章中的“将缓冲区头提交到通用块层”一节)。一旦数据传输成功终止,完成函数就会唤醒任何等待页面传输完成的进程,并销毁 Bio 描述符。mpage_end_io_write( )
The mpage_bio_submit( )
function sets the bi_end_io method
of the bio to the address of mpage_end_io_write( ), then invokes submit_bio( ) to start the transfer (see the
section "Submitting
Buffer Heads to the Generic Block Layer" in Chapter 15). Once the data
transfer successfully terminates, the completion function mpage_end_io_write( ) wakes up any process
waiting for the page transfer to complete, and destroys the bio
descriptor.
[ * ]系统调用的一个变体read( )
——名为readv( )
—允许应用程序定义多个用户模式缓冲区,内核将从文件中读取的数据分散在其中;该_ _generic_file_aio_read( )
函数也处理这种情况。下面,我们假设从文件中读取的数据将被复制到一个用户模式缓冲区中;然而,猜测使用多个缓冲区时要执行的附加步骤很简单。
[*] A variant of the read( )
system call—named readv( )
—allows an application to define multiple User Mode
buffers in which the kernel scatters the data read from the file;
the _ _generic_file_aio_read( )
function handles this case, too. In the following, we will assume
that the data read from the file will be copied into just one User
Mode buffer; however, guessing the additional steps to be
performed when using multiple buffers is straightforward.
[ * ]当访问常规文件时,get_block如果块落入“文件洞”中,该函数可能找不到该块(请参阅第 18 章中的“文件洞”部分)。在这种情况下,该函数用零填充块缓冲区并设置缓冲区头的标志。BH_Uptodate
[*] When accessing a regular file, the get_block function might not
find the block if it falls in a "file hole" (see the
section "File
Holes" in Chapter 18). In
this case, the function fills the block buffer with
zeros and sets the BH_Uptodate flag of the buffer
head.
[ * ]实际上,如果读取操作涉及的页数大于预读窗口的最大大小,则该page_cache_readahead(
)函数会被调用多次。因此,该
req_size参数可能小于完成读取操作而尚未读取的页数。
[*] Actually, if the read operation involves a number of
pages larger than the maximum size of the read-ahead
window, the page_cache_readahead(
) function is invoked several times. Thus, the
req_size parameter
might be smaller than the number of pages yet to be read
to complete the read operation.
[ * ]系统调用的一个变体write( )
——名为writev( )
—允许应用程序定义多个用户模式缓冲区,内核从中获取要写入文件的数据;该generic_file_aio_write_nolock( )
函数也处理这种情况。在接下来的几页中,我们将假设数据将从一个用户模式缓冲区中获取;然而,猜测使用多个缓冲区时要执行的附加步骤很简单。
[*] A variant of the write( )
system call—named writev( )
—allows an application to define multiple User Mode
buffers from which the kernel fetches the data to be written on
the file; the generic_file_aio_write_nolock( )
function handles this case too. In the following pages, we will
assume that the data will be fetched from just one User Mode
buffer; however, guessing the additional steps to be performed
when using multiple buffers is straightforward.
正如第 9 章“内存区域”部分中已经提到的,内存区域可以与基于磁盘的文件系统中的常规文件或块设备文件的某些部分相关联。这意味着对内存区域页面内字节的访问被内核转换为对文件相应字节的操作。这种技术称为内存映射。
As already mentioned in the section "Memory Regions" in Chapter 9, a memory region can be associated with some portion of either a regular file in a disk-based filesystem or a block device file. This means that an access to a byte within a page of the memory region is translated by the kernel into an operation on the corresponding byte of the file. This technique is called memory mapping.
存在两种内存映射:
Two kinds of memory mapping exist:
内存区域页面上的每次写操作都会更改磁盘上的文件;此外,如果一个进程写入共享的页面内存映射时,映射同一文件的所有其他进程都可以看到更改。
Each write operation on the pages of the memory region changes the file on disk; moreover, if a process writes into a page of a shared memory mapping, the changes are visible to all other processes that map the same file.
意味着当进程创建映射只是为了读取文件而不是写入文件时使用。为此,私人映射比共享映射更有效。但对私有映射页面的每次写操作都会导致它停止映射文件中的页面。因此,写入不会更改磁盘上的文件,访问同一文件的任何其他进程也看不到更改。然而,未被该进程修改的私有内存映射的页面会受到其他进程执行的文件更新的影响。
Meant to be used when the process creates the mapping just to read the file, not to write it. For this purpose, private mapping is more efficient than shared mapping. But each write operation on a privately mapped page will cause it to stop mapping the page in the file. Thus, a write does not change the file on disk, nor is the change visible to any other processes that access the same file. However, pages of a private memory mapping that have not been modified by the process are affected by file updates performed by other processes.
进程可以通过发出系统调用来创建新的内存映射(请参阅本章后面的“创建内存映射mmap( )”部分)。程序员必须指定flag或flag作为系统调用的参数;您可以很容易地猜到,在前一种情况下,映射是共享的,而在后一种情况下,映射是私有的。创建映射后,进程只需从新内存区域的内存位置读取即可读取文件中存储的数据。如果内存映射是共享的,进程还可以通过简单地写入相同的内存位置来修改相应的文件。为了破坏或收缩内存映射,进程可以使用系统调用(参见后面的部分“MAP_SHAREDMAP_PRIVATEmunmap(
)销毁内存映射”)。
A process can create a new memory mapping by issuing an mmap( ) system call (see the section "Creating a Memory Mapping"
later in this chapter). Programmers must specify either the MAP_SHARED flag or the MAP_PRIVATE flag as a parameter of the system
call; as you can easily guess, in the former case the mapping is shared,
while in the latter it is private. Once the mapping is created, the
process can read the data stored in the file by simply reading from the
memory locations of the new memory region. If the memory mapping is
shared, the process can also modify the corresponding file by simply
writing into the same memory locations. To destroy or shrink a memory
mapping, the process may use the munmap(
) system call (see the later section "Destroying a Memory
Mapping").
作为一般规则,如果内存映射是共享的,则相应的内存区域会VM_SHARED设置标志;如果它是私有的,则该
VM_SHARED标志被清除。正如我们稍后将看到的,只读共享内存映射存在此规则的例外。
As a general rule, if a memory mapping is shared, the
corresponding memory region has the VM_SHARED flag set; if it is private, the
VM_SHARED flag is cleared. As we'll
see later, an exception to this rule exists for read-only shared memory
mappings.
A memory mapping is represented by a combination of the following data structures :
与映射文件关联的 inode 对象
The inode object associated with the mapped file
address_space映射文件的对象
The address_space object
of the mapped file
不同进程对文件执行的每个不同映射的文件对象
A file object for each different mapping performed on the file by different processes
vm_area_struct
文件上每个不同映射的描述符
A vm_area_struct
descriptor for each different mapping on the file
分配给映射文件的内存区域的每个页帧的页描述符
A page descriptor for each page frame assigned to a memory region that maps the file
图 16-2
说明了数据结构是如何链接的。在图像的左侧,我们显示了标识文件的索引节点。每个inode对象的字段i_mapping都指向address_space文件的对象。依次,page_tree每个address_space对象的字段指向属于地址空间的页的基数树(参见第 15 章中的“基数树”一节),而i_mmap字段指向属于地址空间的内存区域的第二棵树,称为基优先级搜索树(PST)。PST 的主要用途是执行“反向映射”,即快速识别共享给定页面的所有进程。我们将在下一章详细介绍 PST,因为它们用于页框回收。相对于同一文件的文件对象与inode之间的链接是通过字段的方式建立的f_mapping。
Figure 16-2
illustrates how the data structures are linked. On the left side of
the image we show the inode, which identifies the file. The i_mapping field of each inode object points
to the address_space object of the
file. In turn, the page_tree field
of each address_space object points
to the radix tree of pages belonging to the address space (see the
section "The Radix
Tree" in Chapter
15), while the i_mmap field
points to a second tree called the radix priority search tree (PST) of
memory regions belonging to the address space. The main use of PST is
for performing "reverse mapping," that is, for identifying quickly all
processes that share a given page. We'll cover in detail PSTs in the
next chapter, because they are used for page frame reclaiming. The
link between file objects relative to the same file and the inode is
established by means of the f_mapping field.
每个内存区域描述符都有一个vm_file字段,将其链接到映射文件的文件对象(如果该字段为空,则内存映射中不使用该内存区域)。第一个映射位置的位置被存储到vm_pgoff内存区域描述符的字段中;它将文件偏移量表示为多个页面大小单位。映射文件部分的长度就是内存区域的长度,可以根据vm_start和vm_end字段计算出来。
Each memory region descriptor has a vm_file field that links it to the file
object of the mapped file (if that field is null, the memory region is
not used in a memory mapping). The position of the first mapped
location is stored into the vm_pgoff field of the memory region
descriptor; it represents the file offset as a number of page-size
units. The length of the mapped file portion is simply the length of
the memory region, which can be computed from the vm_start and vm_end fields.
共享内存映射的页面始终包含在页面缓存中;只要私有内存映射的页面未被修改,它们就会包含在页面缓存中。当进程尝试修改私有内存映射的页面时,内核会复制该页框,并用进程页表中的副本替换原始页框;这是我们在第 8 章中讨论的 Copy On Write 机制的应用之一。原始页框仍然保留在页缓存中,尽管它不再属于内存映射,因为它被副本替换了。反过来,副本不会插入到页面缓存中,因为它不再包含代表磁盘上文件的有效数据。
Pages of shared memory mappings are always included in the page cache; pages of private memory mappings are included in the page cache as long as they are unmodified. When a process tries to modify a page of a private memory mapping, the kernel duplicates the page frame and replaces the original page frame with the duplicate in the process Page Table; this is one of the applications of the Copy On Write mechanism that we discussed in Chapter 8. The original page frame still remains in the page cache, although it no longer belongs to the memory mapping since it is replaced by the duplicate. In turn, the duplicate is not inserted into the page cache because it no longer contains valid data representing the file on disk.
图 16-2 还显示了页面缓存中包含的一些页面描述符,这些页面描述符引用了内存映射文件。请注意,图中的第一个内存区域有三页长,但只为其分配了两个页框;据推测,拥有该内存区域的进程从未访问过第三页。
Figure 16-2 also shows a few page descriptors of pages included in the page cache that refer to the memory-mapped file. Notice that the first memory region in the figure is three pages long, but only two page frames are allocated for it; presumably, the process owning the memory region has never accessed the third page.
内核提供了几个钩子来为每个不同的文件系统定制内存映射机制。内存映射实现的核心委托给文件对象的名为 的方法mmap。对于大多数基于磁盘的文件系统和块设备文件,此方法由称为 的通用函数实现generic_file_mmap(
),该函数将在下一节中描述。
The kernel offers several hooks to customize the memory mapping
mechanism for every different filesystem. The core of memory mapping
implementation is delegated to a file object's method named mmap. For most disk-based filesystems and
for block device files, this method is implemented by a general
function called generic_file_mmap(
), which is described in the next section.
文件内存映射取决于需求分页机制在第 9 章“请求分页”部分中描述。事实上,新建立的内存映射是不包含任何页的内存区域;当进程引用该区域内的地址时,会发生页面错误,并且页面错误处理程序会检查该内存区域的方法是否已定义。如果未定义,则内存区域不会映射磁盘上的文件;否则,它会执行此操作,并且该方法负责通过访问块设备来读取页面。几乎所有基于磁盘的文件系统和块设备文件都通过该
函数来实现该方法。nopagenopagenopagefilemap_nopage( )
File memory mapping depends on the demand paging mechanism described in the section "Demand Paging" in Chapter 9. In fact, a newly
established memory mapping is a memory region that doesn't include any
page; as the process references an address inside the region, a Page
Fault occurs and the Page Fault handler checks whether the nopage method of the memory region is
defined. If nopage is not defined,
the memory region doesn't map a file on disk; otherwise, it does, and
the method takes care of reading the page by accessing the block
device. Almost all disk-based filesystems and block device files
implement the nopage method by
means of the filemap_nopage( )
function.
要创建新的内存映射,进程会发出
mmap( )系统调用,并向其传递以下参数:
To create a new memory mapping, a process issues an
mmap( ) system call, passing the
following parameters to it:
标识要映射的文件的文件描述符。
A file descriptor identifying the file to be mapped.
文件内的偏移量,指定要映射的文件部分的第一个字符。
An offset inside the file specifying the first character of the file portion to be mapped.
要映射的文件部分的长度。
The length of the file portion to be mapped.
一组标志。该进程必须显式设置该
MAP_SHARED标志或该MAP_PRIVATE标志来指定所请求的内存映射类型。[ * ]
A set of flags. The process must explicitly set either the
MAP_SHARED flag or the MAP_PRIVATE flag to specify the kind of
memory mapping requested.[*]
一组权限,指定对内存区域的一种或多种访问类型:读访问 ( PROT_READ)、写访问 ( PROT_WRITE) 或执行访问 ( PROT_EXEC)。
A set of permissions specifying one or more types of access
to the memory region: read access (PROT_READ), write access (PROT_WRITE), or execution access
(PROT_EXEC).
一个可选的线性地址,内核将其用作新内存区域应开始位置的提示。如果
MAP_FIXED指定了该标志并且内核无法从指定的线性地址开始分配新的内存区域,则系统调用失败。
An optional linear address, which is taken by the kernel as
a hint of where the new memory region should start. If the
MAP_FIXED flag is specified and
the kernel cannot allocate the new memory region starting from the
specified linear address, the system call fails.
系统mmap( )调用返回新内存区域中第一个位置的线性地址。出于兼容性原因,在80×86架构中,内核在系统调用表中保留了两个条目,用于mmap( ) :一个位于索引 90,另一个位于索引 192。前一个条目对应于old_mmap(
)服务例程(由较旧的 C 库使用),而后一个条目对应于服务sys_mmap2(
)例程(由最新的 C 库使用)。这两个服务例程的区别仅在于系统调用的六个参数的传递方式。它们最终都会调用第 9 章“分配线性地址间隔”do_mmap_pgoff( )一节中描述的函数。现在,我们通过详细说明仅在创建映射文件的内存区域时执行的步骤来完成该描述。因此,我们描述了参数(指向文件对象的指针)的情况
filedo_mmap_pgoff( )是非空的。为了清楚起见,我们引用用于描述do_mmap_pgoff( )和指出在新条件下执行的附加步骤的枚举。
The mmap( ) system call
returns the linear address of the first location in the new memory
region. For compatibility reasons, in the 80 × 86 architecture, the
kernel reserves two entries in the system call table for mmap( ) : one at index 90 and the other at index 192. The
former entry corresponds to the old_mmap(
) service routine (used by older C libraries), while the
latter one corresponds to the sys_mmap2(
) service routine (used by recent C libraries). The two
service routines differ only in how the six parameters of the system
call are passed. Both of them end up invoking the do_mmap_pgoff( ) function described in the
section "Allocating a
Linear Address Interval" in Chapter 9. We now complete that
description by detailing the steps performed only when creating a
memory region that maps a file. We thus describe the case where the
file parameter (pointer to a file
object) of do_mmap_pgoff( ) is
non-null. For the sake of clarity, we refer to the enumeration used to
describe do_mmap_pgoff( ) and point
out the additional steps performed under the new condition.
检查mmap
待映射文件的文件操作是否定义;如果不是,则返回错误代码。NULL
文件操作表中的值为,mmap表示对应的文件无法映射(例如,因为它是一个目录)。
Checks whether the mmap
file operation for the file to be mapped is defined; if not, it
returns an error code. A NULL
value for mmap in the file
operation table indicates that the corresponding file cannot be
mapped (for instance, because it is a directory).
该get_unmapped_area(
)函数调用get_unmapped_area文件对象的方法(如果已定义),以便分配适合文件内存映射的线性地址区间。基于磁盘的文件系统没有定义此方法;在这种情况下,如第 9 章“内存区域处理”部分所述,该函数最终会调用内存描述符的方法。get_unmapped_area(
)get_unmapped_area
The get_unmapped_area(
) function invokes the get_unmapped_area method of the file
object, if it is defined, so as to allocate an interval of
linear addresses suitable for the memory mapping of the file.
The disk-based filesystems do not define this method; in this
case, as explained in the section "Memory Region
Handling" in Chapter
9, the get_unmapped_area(
) function ends up invoking the get_unmapped_area method of the memory
descriptor.
除了通常的一致性检查之外,它还会比较请求的内存映射类型(存储在系统调用flags的参数中mmap( ))和打开文件时指定的标志(存储在字段中file->f_mode)。尤其:
如果不满足其中任何条件,则会返回错误代码。
此外,在初始化新内存区域描述符的字段值时,它根据文件的访问权限和请求的内存映射类型vm_flags设置VM_READ、VM_WRITE、VM_EXEC、VM_SHARED、VM_MAYREAD、VM_MAYWRITE、VM_MAYEXEC和标志(请参见“内存区域”一节)访问权”(第 9 章)。作为一种优化,和VM_MAYSHAREVM_SHAREDVM_MAYWRITE清除不可写共享内存映射的标志。可以这样做是因为不允许进程写入内存区域的页面,因此该映射被视为与私有映射相同;然而,内核实际上允许共享该文件的其他进程读取该内存区域中的页面。
In addition to the usual consistency checks, it compares
the kind of memory mapping requested (stored in the flags parameter of the mmap( ) system call) and the flags
specified when the file was opened (stored in the file->f_mode field). In
particular:
If a shared writable memory mapping is required, it
checks that the file was opened for writing and that it was
not opened in append mode (O_APPEND flag of the open( ) system call).
If a shared memory mapping is required, it checks that there is no mandatory lock on the file (see the section "File Locking" in Chapter 12).
For every kind of memory mapping, it checks that the file was opened for reading.
If any of these conditions is not fulfilled, an error code is returned.
Moreover, when initializing the value of the vm_flags field of the new memory
region descriptor, it sets the VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, VM_MAYREAD, VM_MAYWRITE, VM_MAYEXEC, and VM_MAYSHARE flags according to the
access rights of the file and the kind of requested memory
mapping (see the section "Memory Region Access
Rights" in Chapter
9). As an optimization, the VM_SHARED and VM_MAYWRITE flags are cleared for
nonwritable shared memory mapping. This can be done because the
process is not allowed to write into the pages of the memory
region, so the mapping is treated the same as a private mapping;
however, the kernel actually allows other processes that share
the file to read the pages in this memory region.
使用文件对象的地址初始化vm_file
内存区域描述符的字段,并增加文件的使用计数器。调用
mmap被映射文件的方法,将文件对象的地址和内存区域描述符的地址作为参数传递。对于大多数文件系统,该方法由函数实现generic_file_mmap( ),该函数执行以下操作:
将当前时间存储在i_atime文件 inode 字段中,并将 inode 标记为脏。
使用表的地址初始化vm_ops内存区域描述符的字段generic_file_vm_ops。该表中的所有方法都是 null,除了nopage由函数实现的方法filemap_nopage(
)和populate由函数实现的方法filemap_populate( )(参见本章后面的“非线性内存映射”)。
Initializes the vm_file
field of the memory region descriptor with the address of the
file object and increases the file's usage counter. Invokes the
mmap method for the file
being mapped, passing as parameters the address of the file
object and the address of the memory region descriptor. For most
filesystems, this method is implemented by the generic_file_mmap( ) function, which
performs the following operations:
Stores the current time in the i_atime field of the file's inode
and marks the inode as dirty.
Initializes the vm_ops field of the memory region
descriptor with the address of the generic_file_vm_ops table. All
methods in this table are null, except the nopage method, which is
implemented by the filemap_nopage(
) function, and the populate method, which is
implemented by the filemap_populate( ) function (see
"Non-Linear
Memory Mappings" later in this chapter).
增加i_writecount文件inode的字段,即写入进程的使用计数器。
Increases the i_writecount field of the file's
inode, that is, the usage counter for writing processes.
当进程准备销毁内存映射时,它会调用munmap( ); 该系统调用还可用于减小每种内存区域的大小。使用的参数是:
When a process is ready to destroy a memory mapping, it
invokes munmap( ); this system call
can also be used to reduce the size of each kind of memory region. The
parameters used are:
线性地址区间中要删除的第一个位置的地址。
The address of the first location in the linear address interval to be removed.
要删除的线性地址间隔的长度。
The length of the linear address interval to be removed.
sys_munmap( )系统调用的服务例程本质上调用第9章“释放线性地址间隔”do_munmap( )部分中已经描述的函数。请注意,无需将要销毁的可写共享内存映射中包含的页面内容刷新到磁盘。事实上,这些页面继续充当磁盘缓存,因为它们仍然包含在页面缓存中。
The sys_munmap( ) service
routine of the system call essentially invokes the do_munmap( ) function already described in
the section "Releasing a
Linear Address Interval" in Chapter 9. Notice that there is no
need to flush to disk the contents of the pages included in a writable
shared memory mapping to be destroyed. In fact, these pages continue
to act as a disk cache because they are still included in the page
cache.
出于效率原因,页框不会在创建后立即分配给内存映射,而是在最后可能的时刻(即,当进程尝试寻址其页面之一时,从而导致页面错误)例外。
For reasons of efficiency, page frames are not assigned to a memory mapping right after it has been created, but at the last possible moment—that is, when the process attempts to address one of its pages, thus causing a Page Fault exception.
我们在第9章的“页面错误异常处理程序”一节中看到
内核如何验证错误地址是否包含在进程的某些内存区域中;如果是,内核检查与错误地址相对应的页表条目,如果该条目为空,则调用该函数(请参阅第 9 章中的“请求分页”部分)。do_no_page( )
We saw in the section "Page Fault Exception
Handler" in Chapter 9
how the kernel verifies whether the faulty address is included in some
memory region of the process; if so, the kernel checks the Page Table
entry corresponding to the faulty address and invokes the do_no_page( ) function if the entry is null
(see the section "Demand
Paging" in Chapter
9).
该do_no_page( )函数执行所有类型的请求分页所共有的所有操作,例如分配页框和更新页表。它还检查nopage
所涉及的内存区域的方法是否已定义。在第9章的“请求调页”一节中,我们描述了方法未定义的情况(匿名内存区域);现在我们通过讨论定义方法时函数执行的主要操作来完成描述:
The do_no_page( ) function
performs all the operations that are common to all types of demand
paging, such as allocating a page frame and updating the Page Tables.
It also checks whether the nopage
method of the memory region involved is defined. In the section "Demand Paging" in Chapter 9, we described the case
in which the method is undefined (anonymous memory region); now we
complete the description by discussing the main actions performed by
the function when the method is defined:
调用该nopage
方法,该方法返回包含所请求页面的页框的地址。
Invokes the nopage
method, which returns the address of a page frame that contains
the requested page.
如果进程正在尝试写入页面并且内存映射是私有的,则它可以通过复制刚刚读取的页面并将其插入到非活动页面列表中来避免将来出现写入时复制错误(请参阅第 17 章)。如果私有内存映射区域还没有包含新页面的从属匿名内存区域,则它会添加一个新的从属匿名内存区域或扩展现有的内存区域(请参阅第 9 章中的“内存区域”一节)。在以下步骤中,该函数使用新页面而不是返回的页面nopage方法,以便后者不会被用户模式进程修改。
If the process is trying to write into the page and the
memory mapping is private, it avoids a future Copy On Write fault
by making a copy of the page just read and inserting it into the
inactive list of pages (see Chapter 17). If the private
memory mapping region does not already have a slave anonymous
memory region that includes the new page, it either adds a new
slave anonymous memory region or extends an existing one (see the
section "Memory
Regions" in Chapter
9). In the following steps, the function uses the new page
instead of the page returned by the nopage method, so that the latter is not
modified by the User Mode process.
如果某个其他进程已截断或使该页面无效(描述truncate_count符的字段address_space用于此类检查),则该函数会跳回步骤 1 来重试获取该页面。
If some other process has truncated or invalidated the page
(the truncate_count field of
the address_space descriptor is
used for this kind of check), the function retries getting the
page by jumping back to step 1.
增加rss进程内存描述符的字段以指示新的页框已分配给进程。
Increases the rss field
of the process memory descriptor to indicate that a new page frame
has been assigned to the process.
用内存区域字段中包含的页框地址和页访问权限设置故障地址对应的页表项vm_page_prot。
Sets up the Page Table entry corresponding to the faulty
address with the address of the page frame and the page access
rights included in the memory region vm_page_prot field.
如果进程试图写入该页,它会强制页表条目的 和 位为 1。在这种情况下,要么将页框独占分配给该进程,要么共享该页Read/Write。Dirty在这两种情况下,都应该允许对其进行写入。
If the process is trying to write into the page, it forces
the Read/Write and Dirty bits of the Page Table entry to 1.
In this case, either the page frame is exclusively assigned to the
process, or the page is shared; in both cases, writing to it
should be allowed.
请求调页算法的核心是内存区域的nopage方法。一般来说,它必须返回包含进程访问的页面的页框的地址。其实现取决于包含页面的内存区域的类型。
The core of the demand paging algorithm consists of the memory
region's nopage method. Generally
speaking, it must return the address of a page frame that contains the
page accessed by the process. Its implementation depends on the kind
of memory region in which the page is included.
当处理映射磁盘上文件的内存区域时,该
nopage方法必须首先在页面缓存中搜索所请求的页面。如果未找到该页,该方法必须从磁盘读取它。nopage大多数文件系统通过函数来实现该方法filemap_nopage( ),该函数接收三个参数:
When handling memory regions that map files on disk, the
nopage method must first search for
the requested page in the page cache. If the page is not found, the
method must read it from disk. Most filesystems implement the nopage method by means of the filemap_nopage( ) function, which receives
three parameters:
areaarea内存区域的描述符地址,包括所需的页
Descriptor address of the memory region, including the required page
addressaddress所需页的线性地址
Linear address of the required page
typetype指向变量的指针,函数在该变量中写入函数检测到的页错误类型(VM_FAULT_MAJOR或VM_FAULT_MINOR)
Pointer to a variable in which the function writes the
type of page fault detected by the function (VM_FAULT_MAJOR or VM_FAULT_MINOR)
该filemap_nopage( )
函数执行以下步骤:
The filemap_nopage( )
function executes the following steps:
file从字段中获取文件对象地址area->vm_file。从派生
address_space对象地址file->f_mapping。host从对象的字段导出 inode 对象地址address_space。
Gets the file object address file from the area->vm_file field. Derives the
address_space object address
from file->f_mapping.
Derives the inode object address from the host field of the address_space object.
使用vm_start和
vm_pgoff字段area来确定与从 开始的页相对应的数据在文件内的偏移量address。
Uses the vm_start and
vm_pgoff fields of area to determine the offset within the
file of the data corresponding to the page starting from address.
检查文件偏移量是否超过文件大小。发生这种情况时,它返回NULL,这意味着分配新页面失败,除非页面错误是由调试器通过跟踪另一个进程引起的
ptrace( ) 系统调用。我们不打算讨论这个特殊情况。
Checks whether the file offset exceeds the file size. When
this happens, it returns NULL,
which means failure in allocating the new page, unless the Page
Fault was caused by a debugger tracing another process through the
ptrace( ) system call. We are not going to discuss this
special case.
如果VM_RAND_READ设置了内存区域的标志(见下文),我们可以假设进程正在以随机方式读取内存映射的页面。在这种情况下,它会跳转到步骤 10 来忽略预读。
If the VM_RAND_READ flag
of the memory region is set (see below), we may assume that the
process is reading the pages of the memory mapping in a random
way. In this case, it ignores read-ahead by jumping to step
10.
如果VM_SEQ_READ设置了内存区域的标志(见下文),我们可以假设进程正在以严格顺序的方式读取内存映射的页面。在这种情况下,它调用从错误页面开始执行预读(请参阅本章前面的“文件预读page_cache_readahead( )”部分)。
If the VM_SEQ_READ flag
of the memory region is set (see below), we may assume that the
process is reading the pages of the memory mapping in a strictly
sequential way. In this case, it invokes page_cache_readahead( ) to perform
read-ahead starting from the faulty page (see the section "Read-Ahead of Files"
earlier in this chapter).
调用find_get_page( )
在页面缓存中查找由address_space对象和文件偏移量标识的页面。如果找到该页面,则跳转到步骤 11。
Invokes find_get_page( )
to look in the page cache for the page identified by the address_space object and the file
offset. If the page is found, it jumps to step 11.
如果该函数已到达此点,则说明尚未在页缓存中找到该页。检查VM_SEQ_READ内存区域的标志:
如果设置了该标志,则内核会主动提前读取内存区域的页面,因此预读算法失败:它会调用调整预读参数(请参阅“handle_ra_miss( )文件预读”部分)本章前面),然后跳到步骤 10。
否则,如果该标志被清除,则文件描述符mmap_miss中的计数器
加一
。file_ra_state如果未命中的次数远大于命中的次数(存储在计数器中mmap_hit),则跳转到步骤 10 来忽略预读。
If the function has reached this point, the page has not
been found in the page cache. Checks the VM_SEQ_READ flag of the memory
region:
If the flag is set, the kernel is aggressively reading
in advance the pages of the memory region, hence the
read-ahead algorithm has failed: it invokes handle_ra_miss( ) to tune up the
read-ahead parameters (see the section "Read-Ahead of
Files" earlier in this chapter), then jumps to step
10.
Otherwise, if the flag is clear, it increases by one the
mmap_miss counter in the
file_ra_state descriptor of
the file. If the number of misses is much larger than the
number of hits (stored in the mmap_hit counter), it ignores
read-ahead by jumping to step 10.
如果预读未永久禁用(描述符ra_pages中的字段file_ra_state大于零),则它会调用do_page_cache_readahead( )读取所请求页面周围的一组页面。
If read-ahead is not permanently disabled (ra_pages field in the file_ra_state descriptor greater than
zero), it invokes do_page_cache_readahead( ) to read a set
of pages surrounding the requested page.
调用find_get_page( )
检查请求的页面是否在页面缓存中;如果存在,则跳至步骤 11。
Invokes find_get_page( )
to check whether the requested page is in the page cache; if it is
there, jumps to step 11.
调用page_cache_read(
). 该函数检查请求的页面是否已在页面缓存中,如果不存在,则分配一个新的页框,将其添加到页面缓存中,并执行该方法来调度读取页面内容的 I/Omapping->a_ops->readpage操作从磁盘。
Invokes page_cache_read(
). This function checks whether the requested page is
already in the page cache and, if it is not there, allocates a new
page frame, adds it to the page cache, and executes the mapping->a_ops->readpage method to
schedule an I/O operation that reads the page's contents from
disk.
调用函数来可能将交换令牌分配给当前进程(请参阅第 17 章中的“交换令牌”
grab_swap_token(
)部分)。
Invokes the grab_swap_token(
) function to possibly assign the swap token to the
current process (see the section "The Swap Token" in
Chapter 17).
所请求的页面现在位于页面缓存中。文件描述符mmap_hit的计数器
加一。file_ra_state
The requested page is now in the page cache. Increases by
one the mmap_hit counter of the
file_ra_state descriptor of the
file.
如果页面不是最新的(PG_uptodate标志清除),它会调用
lock_page( )锁定页面,执行mapping->a_ops->readpage触发 I/O 数据传输的方法,并调用wait_on_page_bit( )休眠直到页面解锁,即直到数据传输完成。
If the page is not up-to-date (PG_uptodate flag clear), it invokes
lock_page( ) to lock up the
page, executes the mapping->a_ops->readpage method to
trigger the I/O data transfer, and invokes wait_on_page_bit( ) to sleep until the
page is unlocked—that is, until the data transfer
completes.
调用mark_page_accessed(
)将请求的页面标记为已访问(请参阅下一章)。
Invokes mark_page_accessed(
) to mark the requested page as accessed (see next
chapter).
如果在页面缓存中找到该页面的最新版本,则设置*type为
VM_FAULT_MINOR;否则将其设置为VM_FAULT_MAJOR.
If an up-to-date version of the page was found in the page
cache, it sets *type to
VM_FAULT_MINOR; otherwise sets
it to VM_FAULT_MAJOR.
返回所请求页面的地址。
Returns the address of the requested page.
filemap_nopage( )用户模式进程可以通过使用来定制函数的预读行为
madvise( ) 系统调用。该MADV_RANDOM命令设置VM_RAND_READ内存区域的标志,以指定内存区域的页面将以随机顺序访问;该MADV_SEQUENTIAL命令设置VM_SEQ_READ标志以指定将严格按顺序访问页面;最后,该MADV_NORMAL命令重置VM_RAND_READ和
VM_SEQ_READ标志以指定将以未指定的顺序访问页面。
A User Mode process can tailor the read-ahead behavior of the
filemap_nopage( ) function by using
the madvise( ) system call. The MADV_RANDOM command sets the VM_RAND_READ flag of the memory region to
specify that the pages of the memory region will be accessed in random
order; the MADV_SEQUENTIAL command
sets the VM_SEQ_READ flag to
specify that the pages will be accessed in strictly sequential order;
finally, the MADV_NORMAL command
resets both the VM_RAND_READ and
VM_SEQ_READ flags to specify that
the pages will be accessed in a unspecified order.
msync( )进程可以使用系统调用将属于共享内存映射的脏页刷新到磁盘。它接收线性地址间隔的起始地址、间隔的长度以及一组具有以下含义的标志作为其参数:
The msync( ) system
call can be used by a process to flush to disk dirty pages belonging
to a shared memory mapping. It receives as its parameters the starting
address of an interval of linear addresses, the length of the
interval, and a set of flags that have the following meanings:
MS_SYNCMS_SYNC要求系统调用挂起进程,直到 I/O 操作完成。这样,调用进程可以假设当系统调用终止时,其内存映射的所有页面都已刷新到磁盘。
Asks the system call to suspend the process until the I/O operation completes. In this way, the calling process can assume that when the system call terminates, all pages of its memory mapping have been flushed to disk.
MS_ASYNC(的补集
MS_SYNC)MS_ASYNC (complement of
MS_SYNC)要求系统调用立即返回,而不挂起调用进程。
Asks the system call to return immediately without suspending the calling process.
MS_INVALIDATEMS_INVALIDATE要求系统调用使同一文件的其他内存映射无效(没有真正实现,因为在 Linux 中无用)。
Asks the system call to invalidate other memory mappings of the same file (not really implemented, because useless in Linux).
服务sys_msync( )例程msync_interval( )
对包含在线性地址区间中的每个存储器区域进行调用。反过来,后一个函数执行以下操作:
The sys_msync( ) service
routine invokes msync_interval( )
on each memory region included in the interval of linear addresses. In
turn, the latter function performs the following operations:
如果vm_file内存区域描述符的字段为NULL,或者VM_SHARED标志已清除,则返回 0(该内存区域不是文件的可写共享内存映射)。
If the vm_file field of
the memory region descriptor is NULL, or if the VM_SHARED flag is clear, it returns 0
(the memory region is not a writable shared memory mapping of a
file).
调用该filemap_sync(
)函数,该函数扫描与内存区域中包含的线性地址间隔相对应的页表条目。对于找到的每个页面,它会重置Dirty相应页表条目中的标志并调用flush_tlb_page(
)以刷新相应的转换后备缓冲区;然后,它PG_dirty在页面描述符中设置标志以将该页面标记为脏页面。
Invokes the filemap_sync(
) function, which scans the Page Table entries
corresponding to the linear address intervals included in the
memory region. For each page found, it resets the Dirty flag in the corresponding page
table entry and invokes flush_tlb_page(
) to flush the corresponding translation lookaside
buffers; then, it sets the PG_dirty flag in the page descriptor to
mark the page as dirty.
如果MS_ASYNC设置了该标志,则返回。因此,标志的实际作用MS_ASYNC就是设置
PG_dirty内存区域中页面的标志;系统调用实际上并不启动 I/O 数据传输。
If the MS_ASYNC flag is
set, it returns. Therefore, the practical effect of the MS_ASYNC flag consists of setting the
PG_dirty flags of the pages in
the memory region; the system call does not actually start the I/O
data transfers.
如果该函数已到达此点,MS_SYNC则会设置该标志,因此该函数必须将内存区域中的页面刷新到磁盘并使当前进程进入睡眠状态,直到所有 I/O 数据传输终止。为此,该函数获取i_sem文件索引节点的信号量。
If the function has reached this point, the MS_SYNC flag is set, hence the function
must flush the pages in the memory region to disk and put the
current process to sleep until all I/O data transfers terminate.
In order to do this, the function acquires the i_sem semaphore of the file's
inode.
调用该filemap_fdatawrite(
)函数,该函数接收文件
address_space对象的地址。writeback_control该函数本质上是用
同步模式设置一个描述符WB_SYNC_ALL,并检查地址空间是否有内置writepages方法。如果是,则调用相应的函数并返回。在相反的情况下,它执行该mpage_writepages(
)函数。(请参阅本章前面的“将脏页写入磁盘”一节。)
Invokes the filemap_fdatawrite(
) function, which receives the address of the file's
address_space object. This
function essentially sets up a writeback_control descriptor with the
WB_SYNC_ALL synchronization
mode, and checks whether the address space has a built-in writepages method. If so, it invokes the
corresponding function and returns. In the opposite case, it
executes the mpage_writepages(
) function. (See the section "Writing Dirty Pages to
Disk" earlier in this chapter.)
检查fsync
文件对象的方法是否定义;如果是,则执行它。对于常规文件,此方法通常仅限于将文件的 inode 对象刷新到磁盘。然而,对于块设备文件,该方法会调用sync_blockdev(
),这会激活设备所有脏缓冲区的 I/O 数据传输。
Checks whether the fsync
method of the file object is defined; if so, executes it. For
regular files, this method usually limits itself to flushing the
inode object of the file to disk. For block device files, however,
the method invokes sync_blockdev(
), which activates the I/O data transfer of all dirty
buffers of the device.
执行该filemap_fdatawait(
)函数。我们记得第15章的“基数树的标签”一节中提到,页面缓存中的基数树通过标签来标识当前正在写入磁盘的所有页面。该函数快速扫描基数树中覆盖给定线性地址间隔的部分,查找设置了标志的页面;对于每个这样的页面,该函数都会调用睡眠状态,直到
标志被清除,即直到该页面上正在进行的 I/O 数据传输终止。PAGECACHE_TAG_WRITEBACKPG_writebackwait_on_page_bit( )PG_writeback
Executes the filemap_fdatawait(
) function. We recall from the section "The Tags of the Radix
Tree" in Chapter
15 that a radix tree in the page cache identifies all pages
that are currently being written to disk by means of the PAGECACHE_TAG_WRITEBACK tag. The
function quickly scans the portion of the radix tree that covers
the given interval of linear addresses looking for pages having
the PG_writeback flag set; for
each such page, the function invokes wait_on_page_bit( ) to sleep until the
PG_writeback flag is cleared —
that is, until the ongoing I/O data transfer on the page
terminates.
释放i_sem
文件的信号量并返回。
Releases the i_sem
semaphore of the file and returns.
Linux 2.6 内核为常规文件提供了另一种访问方法:非线性内存映射。基本上,非线性内存映射是如上所述的文件内存映射,但其内存页面不映射到文件上的连续页面;相反,每个内存页映射文件数据的随机(任意)页。
The Linux 2.6 kernel offers yet another kind of access method for regular files: the non-linear memory mappings. Basically, a non-linear memory mapping is a file memory mapping as described previously, but its memory pages are not mapped to sequential pages on the file; rather, each memory page maps a random (arbitrary) page of file's data.
当然,用户模式应用程序可以通过调用mmap( ) 重复系统调用,每次调用文件的不同 4096 字节长部分。然而,这种方法对于大文件的非线性映射不是很有效,因为每个映射页都需要自己的内存区域。
Of course, a User Mode application might achieve the same result
by invoking the mmap( ) system call repeatedly, each time on a different
4096-byte-long portion of the file. However, this approach is not very
efficient for non-linear mapping of large files, because each mapping
page requires its own memory region.
为了支持非线性内存映射,内核使用了一些额外的数据结构。首先,VM_NONLINEAR内存区域描述符的标志指定内存区域包含非线性映射。给定文件的非线性映射内存区域的所有描述符都收集在以对象字段为根的双向链接循环列表i_mmap_nonlinear中
address_space。
In order to support non-linear memory mapping, the kernel makes
use of a few additional data structures. First of all, the VM_NONLINEAR flag of the memory region
descriptor specifies that the memory region contains a non-linear
mapping. All descriptors of non-linear mapping memory regions for a
given file are collected in a doubly linked circular list rooted at
the i_mmap_nonlinear field of the
address_space object.
为了创建非线性内存映射,用户模式应用程序首先使用mmap( )系统调用创建正常的共享内存映射。然后,应用程序通过调用 来重新映射内存映射区域中的一些页面
remap_file_pages( )。sys_remap_file_pages( )系统调用的服务例程接收四个参数:
To create a non-linear memory mapping, the User Mode application
first creates a normal shared memory mapping with the mmap( ) system call. Then, the application
remaps some of the pages in the memory mapping region by invoking
remap_file_pages( ). The sys_remap_file_pages( ) service routine of
the system call receives four parameters:
startstart调用进程的共享文件内存映射区域内的线性地址
A linear address inside a shared file memory mapping region of the calling process
sizesize文件重新映射部分的大小(以字节为单位)
Size of the remapped portion of the file in bytes
protprot未使用(必须为零)
Unused (must be zero)
pgoffpgoff要重新映射的初始文件页面的页面索引
Page index of the initial file's page to be remapped
flagsflags控制非线性内存映射的标志
Flags controlling the non-linear memory mapping
服务例程重新映射由从线性地址开始的
pgoff和
参数标识的文件数据部分。如果内存区域未共享或不够大,无法包含映射请求的所有页面,则系统调用将失败并返回错误代码。本质上,服务例程将内存区域插入文件列表中并调用内存区域的方法。sizestarti_mmap_nonlinearpopulate
The service routine remaps the portion of the file's data
identified by the pgoff and
size parameters starting from the
start linear address. If either the
memory region is not shared or it is not large enough to include all
the pages requested for the mapping, the system call fails and an
error code is returned. Essentially, the service routine inserts the
memory region in the i_mmap_nonlinear list of the file and
invokes the populate method of the
memory region.
对于所有常规文件,该populate方法由函数实现
filemap_populate( ),该函数执行以下步骤:
For all regular files, the populate method is implemented by the
filemap_populate( ) function, which
executes the following steps:
MAP_NONBLOCK检查flags参数中的标志是否remap_file_pages( )
系统调用清晰;如果是,则调用do_page_cache_readahead( )提前读取要重新映射的文件的页面。
Checks whether the MAP_NONBLOCK flag in the flags parameter
of the remap_file_pages( )
system call is clear; if so, it invokes do_page_cache_readahead( ) to read in
advance the pages of the file to be remapped.
对于要重新映射的每个页面,执行以下子步骤:
检查页面描述符是否已经包含在页面缓存中;如果它不存在并且MAP_NONBLOCK标志被清除,它将从磁盘读取该页面。
如果页描述符在页缓存中,则更新相应线性地址的页表项,使其指向页框,并更新内存区域描述符中的页计数器。
否则,如果在页缓存中尚未找到该页描述符,则将文件页的偏移量存储在相应线性地址的页表项的高32位中;此外,还清除Present页表条目的位并设置该Dirty
位。
For each page to be remapped, performs the following substeps:
Checks whether the page descriptor is already included
in the page cache; if it is not there and the MAP_NONBLOCK flag is cleared, it
reads the page from disk.
If the page descriptor is in the page cache, it updates the Page Table entry of the corresponding linear address so that it points to the page frame, and updates the counter of pages in the memory region descriptor.
Otherwise, if the page descriptor has not been found in
the page cache, it stores the offset of the file's page in the
32 highest-order bits of the Page Table entry for the
corresponding linear address; also, clears the Present bit of the Page Table entry
and sets the Dirty
bit.
正如第 9章“请求调页”一节中所解释的,当处理请求调页错误时,该函数会检查页表条目中的和位;如果它们具有与非线性内存映射相对应的值,则调用该函数,该函数从页表条目的高位中提取所请求文件页面的索引;然后,调用
内存区域的方法从磁盘读取页面并更新页表条目本身。handle_ pte_fault(
)PresentDirtyhandle_pte_fault( )do_file_page( )do_file_page(
)populate
As explained in the section "Demand Paging" in Chapter 9, when handling a
demand-paging fault the handle_ pte_fault(
) function checks the Present and Dirty bits in the Page Table entry; if they
have the values corresponding to a non-linear memory mapping, handle_pte_fault( ) invokes the do_file_page( ) function, which extracts the
index of the requested file's page from the high-order bits of the
Page Table entry; then, do_file_page(
) invokes the populate
method of the memory region to read the page from disk and update the
Page Table entry itself.
由于非线性内存映射的内存页面是根据相对于文件开头的页面索引(而不是相对于内存区域开头的索引)包含在页面缓存中的,所以非线性内存映射会被刷新与线性内存映射完全一样到磁盘(请参阅本章前面的“将脏内存映射页刷新到磁盘”部分)。
Because the memory pages of a non-linear memory mapping are included in the page cache according to the page index relative to the beginning of the file—rather than the index relative to the beginning of the memory region—non-linear memory mappings are flushed to disk exactly like linear memory mappings (see the section "Flushing Dirty Memory Mapping Pages to Disk" earlier in this chapter).
[ * ]进程还可以设置MAP_ANONYMOUS标志来指定新的内存区域是匿名的——也就是说,不与任何基于磁盘的文件关联(请参阅第9章中的“请求分页”
部分)。进程还可以创建一个既是又是 的内存区域:在这种情况下,该区域映射tmpfs中的一个特殊文件MAP_SHAREDMAP_ANONYMOUS
文件系统(参见第19章中的“ IPC共享内存”部分),它可以被所有进程的后代访问。
[*] The process could also set the MAP_ANONYMOUS flag to specify that
the new memory region is anonymous — that is, not associated
with any disk-based file (see the section "Demand Paging" in
Chapter 9). A
process can also create a memory region that is both MAP_SHARED and MAP_ANONYMOUS: in this case, the
region maps a special file in the tmpfs
filesystem (see the section "IPC Shared
Memory" in Chapter
19), which can be accessed by all the process's
descendants.
正如我们所看到的,在 Linux 2.6 版本中,通过文件系统访问常规文件、通过引用底层块设备文件上的块来访问它、甚至建立文件内存映射都没有本质区别。然而,有一些非常复杂的程序(自缓存应用程序 )想要完全控制整个 I/O 数据传输机制。例如,考虑高性能数据库服务器:大多数服务器都实现自己的缓存机制,利用数据库查询的特殊性质。对于这些类型的程序,内核页面缓存没有帮助;相反,由于以下原因,这是有害的:
As we have seen, in Version 2.6 of Linux, there is no substantial difference between accessing a regular file through the filesystem, accessing it by referencing its blocks on the underlying block device file, or even establishing a file memory mapping. There are, however, some highly sophisticated programs (self-caching applications ) that would like to have full control of the whole I/O data transfer mechanism. Consider, for example, high-performance database servers: most of them implement their own caching mechanisms that exploit the peculiar nature of the queries to the database. For these kinds of programs, the kernel page cache doesn't help; on the contrary, it is detrimental for the following reasons:
大量页帧被浪费在复制 RAM 中(用户级磁盘缓存)中已有的磁盘数据上。
Lots of page frames are wasted to duplicate disk data already in RAM (in the user-level disk cache).
处理页面缓存和预读的冗余指令会减慢 和 系统调用的速度read( );write( )与文件内存映射相关的分页操作也是如此。
The read( ) and write( ) system calls are slowed down by
the redundant instructions that handle the page cache and the
read-ahead; ditto for the paging operations related to the file
memory mappings.
read( )
和系统调用不是直接在磁盘和用户内存之间传输数据,而是write( )进行两次传输:磁盘和内核缓冲区之间以及内核缓冲区和用户内存之间。
Rather than transferring the data directly between the disk
and the user memory, the read( )
and write( ) system calls make
two transfers: between the disk and a kernel buffer and between the
kernel buffer and the user memory.
由于块硬件设备必须通过中断和直接内存访问 (DMA) 进行处理,并且这只能在内核模式下完成,因此肯定需要某种内核支持来实现自缓存应用程序。
Because block hardware devices must be handled through interrupts and Direct Memory Access (DMA), and this can be done only in Kernel Mode, some sort of kernel support is definitely required to implement self-caching applications.
Linux 提供了一种绕过页面缓存的简单方法: 直接 I/O 传输。在每个 I/O 直接传输中,内核对磁盘控制器进行编程,以将数据直接从属于自缓存应用程序的用户模式地址空间的页面传输或传输到属于自缓存应用程序的用户模式地址空间的页面。
Linux offers a simple way to bypass the page cache: direct I/O transfers. In each I/O direct transfer, the kernel programs the disk controller to transfer the data directly from/to pages belonging to the User Mode address space of a self-caching application.
众所周知,每次数据传输都是异步进行的。在进行过程中,内核可能会切换当前进程,CPU 可能会返回到用户模式,引发数据传输的进程页面可能会被换出,等等。这对于普通 I/O 数据传输来说效果很好,因为它们涉及磁盘缓存的页面。磁盘缓存归内核所有,无法换出,并且对内核模式下的所有进程可见。
As we know, each data transfer proceeds asynchronously. While it is in progress, the kernel may switch the current process, the CPU may return to User Mode, the pages of the process that raised the data transfer might be swapped out, and so on. This works just fine for ordinary I/O data transfers because they involve pages of the disk caches . Disk caches are owned by the kernel, cannot be swapped out, and are visible to all processes in Kernel Mode.
另一方面,直接 I/O 传输应该在属于给定进程的用户模式地址空间的页面内移动数据。内核必须注意内核模式下的每个进程都可以访问这些页面,并且在数据传输过程中它们不会被换出。让我们看看这是如何实现的。
On the other hand, direct I/O transfers should move data within pages that belong to the User Mode address space of a given process. The kernel must take care that these pages are accessible by every process in Kernel Mode and that they are not swapped out while the data transfer is in progress. Let us see how this is achieved.
当自缓存应用程序希望直接访问文件时,它会打开指定标志的文件(参见第12章中的“ open()系统调用”O_DIRECT部分)。在维修时open( ) 系统调用时,该dentry_open(
)函数检查正在打开的文件的对象是否direct_IO实现了该方法
address_space,如果相反则返回错误代码。也可以使用以下命令O_DIRECT为已打开的文件设置该标志
F_SETFLfcntl( ) 系统调用。
When a self-caching application wishes to directly access a file,
it opens the file specifying the O_DIRECT flag (see the section "The open( ) System Call"
in Chapter 12). While
servicing the open( ) system call, the dentry_open(
) function checks whether the direct_IO method is implemented for the
address_space object of the file
being opened, and returns an error code in the opposite case. The
O_DIRECT flag can also be set for a
file already opened by using the F_SETFL command of the fcntl( ) system call.
read( )让我们首先考虑自缓存应用程序对扩展名为 的文件发出系统调用的情况O_DIRECT。正如本章前面的“从文件读取”一节中提到的, readfile 方法通常由
generic_file_read( )函数实现,该函数初始化iovec和kiocb描述符并调用_ _generic_file_aio_read( )。后一个函数验证描述符描述的用户模式缓冲区iovec是否有效,然后检查O_DIRECT文件的标志是否设置。当被read( )系统调用调用时,该函数执行的代码片段基本上等同于以下内容:
Let us consider first the case where the self-caching application
issues a read( ) system call on a
file with O_DIRECT. As mentioned in
the section "Reading from a
File" earlier in this chapter, the read file method is usually implemented by the
generic_file_read( ) function, which
initializes the iovec and kiocb descriptors and invokes _ _generic_file_aio_read( ). The latter
function verifies that the User Mode buffer described by the iovec descriptor is valid, then checks whether
the O_DIRECT flag of the file is set.
When invoked by a read( ) system
call, the function executes a code fragment essentially equivalent to
the following:
if (filp->f_flags & O_DIRECT) {
if (count == 0 || *ppos > filp->f_mapping->host->i_size)
返回0;
retval = generic_file_direct_IO(READ, iocb, iov, *ppos, 1);
如果(返回值 > 0)
*ppos += retval;
file_accessed(filp);
返回retval;
}if (filp->f_flags & O_DIRECT) {
if (count == 0 || *ppos > filp->f_mapping->host->i_size)
return 0;
retval = generic_file_direct_IO(READ, iocb, iov, *ppos, 1);
if (retval > 0)
*ppos += retval;
file_accessed(filp);
return retval;
}该函数检查文件指针的当前值、文件大小和请求的字符数,然后调用该
generic_file_direct_IO( )函数,向其传递READ操作类型、iocb描述符、
iovec描述符、文件指针的当前值和描述符中指定的用户模式缓冲区的数量io_vec(一)。终止时
generic_file_direct_IO( ),
_ _generic_file_aio_read( )更新文件指针,在文件的 inode 上设置访问时间戳,然后返回。
The function checks the current values of the file pointer, the
file size, and the number of requested characters, and then invokes the
generic_file_direct_IO( ) function,
passing to it the READ operation
type, the iocb descriptor, the
iovec descriptor, the current value
of the file pointer, and the number of User Mode buffers specified in
the io_vec descriptor (one). When
generic_file_direct_IO( ) terminates,
_ _generic_file_aio_read( ) updates
the file pointer, sets the access timestamp on the file's inode, and
returns.
write(
)当对设置了标志的文件发出系统调用时,会发生类似的情况O_DIRECT。正如本章前面的“写入文件”一节中提到的,write
文件的方法最终会调用generic_file_aio_write_nolock( ):该函数检查是否O_DIRECT
设置了标志,如果设置了,则调用该generic_file_direct_IO( )函数,这次指定WRITE操作类型。
Something similar happens when a write(
) system call is issued on a file having the O_DIRECT flag set. As mentioned in the section
"Writing to a File"
earlier in this chapter, the write
method of the file ends up invoking generic_file_aio_write_nolock( ): this
function checks whether the O_DIRECT
flag is set and, if so, invokes the generic_file_direct_IO( ) function, this time
specifying the WRITE operation
type.
该generic_file_direct_IO( )
函数作用于以下参数:
The generic_file_direct_IO( )
function acts on the following parameters:
rwrw操作类型:READ
或WRITE
Type of operation: READ
or WRITE
iocbiocb指向kiocb
描述符的指针(见表16-1)
Pointer to a kiocb
descriptor (see Table
16-1)
ioviov指向描述符数组的指针(请参阅本章前面的“从文件读取iovec”部分)
Pointer to an array of iovec descriptors (see the section
"Reading from a
File" earlier in this chapter)
offsetoffset文件偏移量
File offset
nr_segsnr_segs
数组iovec
中描述符的数量iov
Number of iovec
descriptors in the iov
array
执行的步骤generic_file_direct_IO( )如下:
The steps performed by generic_file_direct_IO( ) are the
following:
从描述符的字段中获取file文件对象的地址,从该字段中获取对象
的
地址。ki_filpkiocbmappingaddress_spacefile->f_mapping
Gets the address file of
the file object from the ki_filp
field of the kiocb descriptor,
and the address mapping of the
address_space object from the
file->f_mapping field.
如果操作类型为WRITE且如果一个或多个进程已创建与文件的一部分关联的内存映射,则它会调用unmap_mapping_range( )取消映射文件的所有页面。此功能还确保如果与要取消映射的页面对应的任何页表条目设置了该位
Dirty,则相应的页面在页面高速缓存中被标记为脏。
If the type of operation is WRITE and if one or more processes have
created a memory mapping associated with a portion of the file, it
invokes unmap_mapping_range( ) to
unmap all pages of the file. This function also ensures that if any
Page Table entry corresponding to a page to be unmapped has the
Dirty bit set, then the
corresponding page is marked as dirty in the page cache.
如果以 为根的基数树mapping不为空(mapping->nrpages大于零),它将调用filemap_fdatawrite( )
和filemap_fdatawait( )
函数将所有脏页刷新到磁盘并等待 I/O 操作完成(请参阅“刷新脏内存映射页到磁盘”部分)本章前面)。(即使自缓存应用程序直接访问文件,系统中也可能有其他应用程序通过页面缓存访问该文件。为了避免数据丢失,在启动直接缓存之前,磁盘映像会与页面缓存同步。 /O 转移。)
If the radix tree rooted at mapping is not empty (mapping->nrpages greater than zero), it
invokes the filemap_fdatawrite( )
and filemap_fdatawait( )
functions to flush all dirty pages to disk and to wait until the I/O
operations complete (see the section "Flushing Dirty Memory Mapping
Pages to Disk" earlier in this chapter). (Even if the
self-caching application is accessing the file directly, there could
be other applications in the system that access the file through the
page cache. To avoid data loss, the disk image is synchronized with
the page cache before starting the direct I/O transfer.)
调用地址空间direct_IO
的方法mapping(参见以下段落)。
Invokes the direct_IO
method of the mapping address
space (see the following paragraphs).
如果操作类型为WRITE,则调用invalidate_inode_pages2( )扫描 的基数树中的所有页面mapping并释放它们。该函数还清除引用这些页面的用户模式页表条目。
If the operation type was WRITE, it invokes invalidate_inode_pages2( ) to scan all
pages in the radix tree of mapping and to release them. The function
also clears the User Mode Page Table entries that refer to those
pages.
在大多数情况下,该direct_IO
方法是函数的包装器_
_blockdev_direct_IO( )。该函数相当复杂,调用了大量辅助数据结构和函数;然而,它执行的操作本质上与本章中已描述的操作类型相同:它将要读取或写入的数据分割到合适的块中,在磁盘上定位数据,并填充一个或多个描述 I/O 操作的 Bio 描述符待执行。iovec当然,数据将直接在数组中的描述符指定的用户模式缓冲区中读取或写入iov。通过调用函数将生物描述符提交给通用块submit_bio( )层(参见“将缓冲区头提交到通用块层”(第15 章)。通常,_ _blockdev_direct_IO( )直到所有直接 I/O 传输完成后,该函数才会返回;因此,一旦read( )或write( )系统调用返回,自缓存应用程序就可以安全地访问包含文件数据的缓冲区。
In most cases, the direct_IO
method is a wrapper for the _
_blockdev_direct_IO( ) function. This function is quite
complex and invokes a large number of auxiliary data structures and
functions; however, it executes essentially the same kind of operations
already described in this chapter: it splits the data to be read or
written in suitable blocks, locates the data on disk, and fills up one
or more bio descriptors that describe the I/O operations to be
performed. Of course, the data will be read or written directly in the
User Mode buffers specified by the iovec descriptors in the iov array. The bio descriptors are submitted
to the generic block layer by invoking the submit_bio( ) function (see the section "Submitting Buffer Heads to the
Generic Block Layer" in Chapter 15). Usually, the _ _blockdev_direct_IO( ) function does not
return until all direct I/O transfers have been completed; thus, once
the read( ) or write( ) system call returns, the self-caching
application can safely access the buffers containing the file
data.
POSIX 1003.1 标准定义了一组库函数(表 16-4中列出),用于以异步方式访问文件。“异步”本质上意味着当用户模式进程调用库函数来读取或写入文件时,一旦读取或写入操作入队,该函数就会终止,甚至可能在实际 I/O 数据传输发生之前。因此,调用进程可以在数据传输时继续执行。
The POSIX 1003.1 standard defines a set of library functions—listed in Table 16-4—for accessing the files in an asynchronous way. "Asynchronous" essentially means that when a User Mode process invokes a library function to read or write a file, the function terminates as soon as the read or write operation has been enqueued, possibly even before the actual I/O data transfer takes place. The calling process can thus continue its execution while the data is being transferred.
表 16-4。用于异步 I/O 的 POSIX 库函数
Table 16-4. The POSIX library functions for asynchronous I/O
功能 Function | 描述 Description |
|---|---|
从文件中异步读取一些数据 Asynchronously reads some data from a file | |
异步将一些数据写入文件 Asynchronously writes some data into a file | |
请求所有未完成的异步 I/O 操作的刷新操作(不阻塞) Requests a flush operation for all outstanding asynchronous I/O operations (does not block) | |
获取未完成的异步 I/O 操作的错误代码 Gets the error code for an outstanding asynchronous I/O operation | |
获取已完成的异步 I/O 操作的返回码 Gets the return code for a completed asynchronous I/O operation | |
取消未完成的异步 I/O 操作 Cancels an outstanding asynchronous I/O operation | |
挂起进程,直到多个未完成的 I/O 操作中的至少一个完成 Suspends the process until at least one of several outstanding I/O operations completes |
使用异步 I/O 非常简单。应用程序通过通常的方式打开文件open( )
系统调用。struct aiocb然后,它用描述所请求操作的信息填充类型的控制块
。控制块最常用的字段
struct aiocb是:
Using asynchronous I/O is quite simple. The application opens the
file by means of the usual open( )
system call. Then, it fills up a control block of type
struct aiocb with the information
describing the requested operation. The most commonly used fields of the
struct aiocb control block
are:
aio_fildesaio_fildes文件的文件描述符(由open( )系统调用返回)
The file descriptor of the file (as returned by the open( ) system call)
aio_bufaio_buf文件数据的用户模式缓冲区
The User Mode buffer for the file's data
aio_nbytesaio_nbytes应传输多少字节
How many bytes should be transferred
aio_offsetaio_offset文件中读取或写入操作将开始的位置(它独立于“同步”文件指针)
Position in the file where the read or write operation will start (it is independent of the "synchronous" file pointer)
最后,应用程序将控制块的地址传递给aio_read( ) 或者aio_write( )
; 一旦请求的 I/O 数据传输被系统库或内核排队,这两个函数就会终止。应用程序稍后可以通过调用 来检查未完成的 I/O 操作的状态aio_error( ),如果数据传输仍在进行中,则返回EINPROGRESS0;如果成功完成,则返回 0;如果失败,则返回错误代码。该aio_return(
)函数返回已完成的异步 I/O 操作有效读取或写入的字节数,如果-失败则返回 1。
Finally, the application passes the address of the control block
to either aio_read( ) or aio_write( )
; both functions terminate as soon as the requested I/O
data transfer has been enqueued by the system library or kernel. The
application can later check the status of the outstanding I/O operation
by invoking aio_error( ), which
returns EINPROGRESS if the data
transfer is still in progress, 0 if it is successfully completed, or an
error code in case of failure. The aio_return(
) function returns the number of bytes effectively read or
written by a completed asynchronous I/O operation, or -1 in case of failure.
异步 I/O 可以通过系统库来实现,而无需任何内核支持。本质上,aio_read( )或aio_write( )库函数克隆当前进程并让子进程调用同步进程read( ) 或者write( )
系统调用;然后,父进程终止aio_read( )oraio_write( )函数并继续执行程序,因此它不会等待子进程启动的同步操作完成。然而,这个“穷人”版本的 POSIX 函数比使用异步 I/O 的内核级实现的版本要慢得多。
Asynchronous I/O can be implemented by a system library
without any kernel support at all. Essentially, the aio_read( ) or aio_write( ) library function clones the
current process and lets the child invoke the synchronous read( ) or write( )
system calls; then, the parent terminates the aio_read( ) or aio_write( ) function and continues the
execution of the program, hence it does not wait for the synchronous
operation started by the child to finish. However, this "poor man's"
version of the POSIX functions is significantly slower than a version
that uses a kernel-level implementation of asynchronous I/O.
Linux 2.6 内核版本提供了一组用于异步 I/O 的系统调用。然而,在 Linux 2.6.11 中,此功能仍在开发中,异步 I/O 仅适用于使用该O_DIRECT标志设置打开的文件(请参阅上一节)。表16-5列出了异步I/O的系统调用
。
The Linux 2.6 kernel version sports a set of system calls for
asynchronous I/O. However, in Linux 2.6.11 this feature is a work in
progress, and asyncronous I/O works properly only for files opened
with the O_DIRECT flag set (see the
previous section). The system calls for asynchronous I/O are listed in
Table 16-5.
表 16-5。Linux系统调用异步I/O
Table 16-5. Linux system calls for asynchronous I/O
系统调用 System call | 描述 Description |
|---|---|
为当前进程初始化异步上下文 Initializes an asynchronous context for the current process | |
提交一个或多个异步 I/O 操作 Submits one or more asynchronous I/O operations | |
获取一些未完成的异步 I/O 操作的完成状态 Gets the completion status of some outstanding asynchronous I/O operations | |
取消未完成的 I/O 操作 Cancels an outstanding I/O operation | |
删除当前进程的异步上下文 Removes an asynchronous context for the current process |
如果用户态进程想要利用
io_submit( )系统调用来启动异步I/O操作,则必须事先创建
异步I/O上下文。
If a User Mode process wants to make use of the
io_submit( ) system call to start
an asynchronous I/O operation, it must create beforehand an
asynchronous I/O context.
基本上,异步 I/O 上下文(简称 AIO 上下文)是一组数据结构,用于跟踪进程请求的异步 I/O 操作的持续进度。每个 AIO 上下文都与一个对象关联kioctx,该对象存储与该上下文相关的所有信息。一个应用程序可能会创建多个 AIO 上下文;给定进程的所有描述符都收集在一个以内存描述符字段kioctx
为根的单链表中(参见第9章中的表9-2)。ioctx_list
Basically, an asynchronous I/O context (in short, AIO context)
is a set of data structures that keep track of the on-going
progresses of the asynchronous I/O operations requested by the
process. Each AIO context is associated with a kioctx object, which stores all
information relevant for the context. An application might create
several AIO contexts; all kioctx
descriptors of a given process are collected in a singly linked list
rooted at the ioctx_list field of
the memory descriptor (see Table 9-2 in Chapter 9).
我们不打算详细讨论该kioctx对象;然而,我们应该查明对象引用的一个重要数据结构kioctx:AIO 环。
We are not going to discuss in detail the kioctx object; however, we should pinpoint
an important data structure referenced by the kioctx object: the AIO ring.
AIO环是用户模式进程地址空间中的内存缓冲区,内核模式中的所有进程也可以访问它。AIO 环的用户模式起始地址和长度分别存储在对象的ring_info.mmap_base和字段中。组成AIO环的所有页框的描述符存储在该字段指向的数组中。ring_info.mmap_sizekioctxring_info.ring_pages
The AIO ring is a memory buffer in the
address space of the User Mode process that is also accessible by
all processes in Kernel Mode. The User Mode starting address and
length of the AIO ring are stored in the ring_info.mmap_base and ring_info.mmap_size fields of the kioctx object, respectively. The
descriptors of all page frames composing the AIO ring are stored in
an array pointed to by the ring_info.ring_pages field.
AIO 环本质上是一个循环缓冲区,内核在其中写入未完成的异步 I/O 操作的完成报告。AIO 环的第一个字节包含一个标头(
struct aio_ring数据结构);其余字节存储io_event数据结构,每个字节描述一个已完成的异步 I/O 操作。由于AIO环的页面被映射到进程的用户模式地址空间中,因此应用程序可以直接检查未完成的异步I/O操作的进度,从而避免使用相对较慢的系统调用。
The AIO ring is essentially a circular buffer where the kernel
writes the completion reports of the outstanding asynchronous I/O
operations. The first bytes of the AIO ring contain an header (a
struct aio_ring data structure);
the remaining bytes store io_event data structures, each of which
describes a completed asynchronous I/O operation. Because the pages
of the AIO ring are mapped in the User Mode address space of the
process, the application can check directly the progress of the
outstanding asynchronous I/O operations, thus avoiding using a
relatively slow system call.
系统io_setup( )调用为调用进程创建一个新的 AIO 上下文。它需要两个参数:未完成的异步 I/O 操作的最大数量,它最终决定 AIO 环的大小,以及指向将存储上下文句柄的变量的指针;该句柄也是AIO环的基地址。服务sys_io_setup( )例程本质上调用为将包含AIO环的进程分配一个新的匿名内存区域(参见第9章中的“分配线性地址间隔”do_mmap( )部分),并创建并初始化一个描述AIO上下文的对象。kioctx
The io_setup( ) system call
creates a new AIO context for the calling process. It expects two
parameters: the maximum number of outstanding asynchronous I/O
operations, which ultimately determines the size of the AIO ring,
and a pointer to a variable that will store a handle to the context;
this handle is also the base address of the AIO ring. The sys_io_setup( ) service routine
essentially invokes do_mmap( ) to
allocate a new anonymous memory region for the process that will
contain the AIO ring (see the section "Allocating a Linear Address
Interval" in Chapter
9), and creates and initializes a kioctx object describing the AIO
context.
相反,io_destroy(
)系统调用删除 AIO 上下文;它还会破坏包含相应 AIO 环的匿名内存区域。系统调用会阻塞当前进程,直到所有未完成的异步 I/O 操作完成。
Conversely, the io_destroy(
) system call removes an AIO context; it also destroys the
anonymous memory region containing the corresponding AIO ring. The
system call blocks the current process until all outstanding
asynchronous I/O operations are complete.
为了启动一些异步 I/O 操作,应用程序调用io_submit(
)系统调用。系统调用有三个参数:
To start some asynchronous I/O operations, the
application invokes the io_submit(
) system call. The system call has three
parameters:
ctx_idctx_id返回的句柄io_setup( ),标识 AIO 上下文
The handle returned by io_setup( ), which identifies the
AIO context
iocbppiocbpp指向 类型描述符的指针数组的地址iocb,每个描述符描述一个异步 I/O 操作
The address of an array of pointers to descriptors of
type iocb, each of which
describes one asynchronous I/O operation
nrnr指向的数组的长度iocbpp
The length of the array pointed to by iocbpp
数据iocb结构包括与 POSIXaiocb描述符相同的字段(aio_fildes、aio_buf、aio_nbytes、aio_offset)以及aio_lio_opcode存储请求操作类型(通常是读、写或同步)的字段。
The iocb data structure
includes the same fields as the POSIX aiocb descriptor (aio_fildes, aio_buf, aio_nbytes, aio_offset) plus the aio_lio_opcode field that stores the type
of the requested operation (typically read, write, or sync).
服务例程sys_io_submit(
)主要执行以下步骤:
The service routine sys_io_submit(
) performs essentially the following steps:
验证iocb描述符数组是否有效。
Verifies that the array of iocb descriptors is valid.
在以内存描述符字段为根的列表中搜索kioctx
与句柄对应的对象。ctx_idioctx_list
Searches the kioctx
object corresponding to the ctx_id handle in the list rooted at
the ioctx_list field of the
memory descriptor.
对于iocb
数组中的每个描述符,它执行以下子步骤:
获取字段中存储的文件描述符对应的文件对象的地址aio_fildes。
kiocb为 I/O 操作分配并初始化一个新描述符。
检查AIO环中是否有空闲槽来存储操作的完成结果。
根据操作的类型设置描述符的ki_retry
方法(见下文)。kiocb
执行该aio_run_iocb(
)函数,实质上是调用该ki_retry方法来启动相应异步 I/O 操作的 I/O 数据传输。如果该ki_retry方法返回值
-EIOCBRETRY,则异步 I/O 操作已提交但尚未完全满足:稍后aio_run_iocb(
)将对此再次调用该函数(见下文)。kiocb否则,它会调用aio_complete(
)在 AIO 上下文环中添加异步 I/O 操作的完成事件。
For each iocb
descriptor in the array, it executes the following
substeps:
Gets the address of the file object corresponding to
the file descriptor stored in the aio_fildes field.
Allocates and initializes a new kiocb descriptor for the I/O
operation.
Checks that there is a free slot in the AIO ring to store the completion result of the operation.
Sets the ki_retry
method of the kiocb
descriptor according to the type of the operation (see
below).
Executes the aio_run_iocb(
) function, which essentially invokes the ki_retry method to start the I/O
data transfer for the corresponding asynchronous I/O
operation. If the ki_retry method returns the value
-EIOCBRETRY, the
asynchronous I/O operation has been submitted but not yet
fully satisfied: the aio_run_iocb(
) function will be invoked again on this kiocb at a later time (see below).
Otherwise, it invokes aio_complete(
) to add a completion event for the asynchronous
I/O operation in the ring of the AIO context.
如果异步I/O操作是读请求,则
通过 实现ki_retry对应描述符的方法。该函数本质上执行文件对象的方法,然后根据该方法返回的值更新描述符的和字段(参见本章前面的表 16-1 ) 。最后,返回从文件中有效读取的字节数,或者如果函数确定尚未传输所有请求的字节,则返回该值。对于大多数文件系统,文件对象的方法最终会调用kiocbaio_pread( )aio_readki_bufki_leftkiocbaio_readaio_pread( )-EIOCBRETRYaio_read_
_generic_file_aio_read( )功能。假设
O_DIRECT设置了文件的标志,该函数最终会调用该generic_file_direct_IO( )函数,如上一节所述。然而,在这种情况下,该
_ _blockdev_direct_IO( )函数不会阻塞当前进程等待 I/O 数据传输完成;相反,该函数立即返回。因为异步 I/O 操作仍然未完成,所以aio_run_iocb( )将再次调用,这次是由aio 工作队列的内核线程aio_wq。描述kiocb符跟踪 I/O 数据传输的进度;最终所有请求的数据都会被传输,完成结果将添加到AIO环中。
If the asynchronous I/O operation is a read request, the
ki_retry method of the
corresponding kiocb descriptor is
implemented by aio_pread( ). This
function essentially executes the aio_read method of the file object, then
updates the ki_buf and ki_left fields of the kiocb descriptor (see Table 16-1 earlier in
this chapter) according to the value returned by the aio_read method. Finally, aio_pread( ) returns the number of bytes
effectively read from the file, or the value -EIOCBRETRY if the function determines
that not all requested bytes have been transferred. For most
filesystems, the aio_read method
of the file object ends up invoking the _
_generic_file_aio_read( ) function. Assuming that the
O_DIRECT flag of the file is set,
this function ends up invoking the generic_file_direct_IO( ) function, as
described in the previous section. In this case, however, the
_ _blockdev_direct_IO( ) function
does not block the current process waiting for the I/O data transfer
to complete; instead, the function returns immediately. Because the
asynchronous I/O operation is still outstanding, the aio_run_iocb( ) will be invoked again,
this time by the aio kernel thread of the aio_wq work queue. The kiocb descriptor keeps track of the
progress of the I/O data transfer; eventually all requested data
will be transferred and the completion result will be added to the
AIO ring.
同样,如果异步I/O操作是写请求,则描述符ki_retry的方法kiocb由 实现aio_pwrite( )。该函数本质上执行aio_write文件对象的方法,然后根据该方法返回的值更新描述符的ki_buf和ki_left字段kiocb(参见本章前面的表 16-1aio_write ) 。最后,aio_pwrite( )返回有效写入文件的字节数,或者-EIOCBRETRY如果函数确定尚未传输所有请求的字节,则返回该值。对于大多数文件系统,aio_write文件对象的方法最终会调用generic_file_aio_write_nolock( )功能。假设O_DIRECT设置了文件的标志,该函数最终会调用该generic_file_direct_IO( )函数,如上所述。
Similarly, if the asynchronous I/O operation is a write
request, the ki_retry method of
the kiocb descriptor is
implemented by aio_pwrite( ).
This function essentially executes the aio_write method of the file object, then
updates the ki_buf and ki_left fields of the kiocb descriptor (see Table 16-1 earlier in
this chapter) according to the value returned by the aio_write method. Finally, aio_pwrite( ) returns the number of bytes
effectively written to the file, or the value -EIOCBRETRY if the function determines
that not all requested bytes have been transferred. For most
filesystems, the aio_write method
of the file object ends up invoking the generic_file_aio_write_nolock( ) function.
Assuming that the O_DIRECT flag
of the file is set, this function ends up invoking the generic_file_direct_IO( ) function, as
above.
在前面的章节中,我们解释了内核如何通过跟踪空闲和繁忙的页帧来处理动态内存。我们还讨论了用户模式下的每个进程如何拥有自己的地址空间,并且内核一次一页地满足其内存请求,以便可以在最后一刻将页帧分配给进程。最后但并非最不重要的一点是,我们展示了内核如何利用动态内存来实现内存和磁盘缓存。
In previous chapters, we explained how the kernel handles dynamic memory by keeping track of free and busy page frames. We have also discussed how every process in User Mode has its own address space and has its requests for memory satisfied by the kernel one page at a time, so that page frames can be assigned to the process at the very last possible moment. Last but not least, we have shown how the kernel makes use of dynamic memory to implement both memory and disk caches .
在本章中,我们通过讨论页框回收来完成对虚拟内存子系统的描述。我们将从第一部分“页帧回收算法”开始,解释为什么内核需要回收页帧以及它使用什么策略来实现这一点。然后我们在“反向映射”一节中进行技术题外话,讨论内核用来快速定位指向同一页框的所有页表条目的数据结构。“实现 PFRA ”一节专门介绍 Linux 使用的页框回收算法。最后一个主要部分“交换,”本身几乎就是一章:它涵盖了交换子系统,这是一个用于在磁盘上保存匿名(不映射文件数据)页面的内核组件。
In this chapter, we complete our description of the virtual memory subsystem by discussing page frame reclaiming. We'll start in the first section, "The Page Frame Reclaiming Algorithm," explaining why the kernel needs to reclaim page frames and what strategy it uses to achieve this. We then make a technical digression in the section "Reverse Mapping" to discuss the data structures used by the kernel to locate quickly all the Page Table entries that point to the same page frame. The section "Implementing the PFRA" is devoted to the page frame reclaiming algorithm used by Linux. The last main section, "Swapping," is almost a chapter by itself: it covers the swap subsystem, a kernel component used to save anonymous (not mapping data of files) pages on disk.
Linux 令人着迷的方面之一是,在将动态内存分配给用户模式进程或内核之前执行的检查有些敷衍。
One of the fascinating aspects of Linux is that the checks performed before allocating dynamic memory to User Mode processes or to the kernel are somewhat perfunctory.
例如,没有对分配给单个用户创建的进程的 RAM 总量进行严格检查(第 3 章“进程资源限制” 部分提到的限制主要影响单个进程)。同样,许多磁盘缓存和内存缓存的大小也没有限制由内核使用。
No rigorous check is made, for instance, on the total amount of RAM assigned to the processes created by a single user (the limits mentioned in the section "Process Resource Limits" in Chapter 3 mostly affect single processes). Similarly, no limit is placed on the size of the many disk caches and memory caches used by the kernel.
这种缺乏控制的设计选择允许内核以最佳方式使用可用的 RAM。当系统负载较低时,RAM 大部分由磁盘缓存填充,少数正在运行的进程可以从存储在其中的信息中受益。然而,当系统负载增加时,RAM 大部分被进程的页面填满,并且缓存会缩小以为其他进程腾出空间。
This lack of controls is a design choice that allows the kernel to use the available RAM in the best possible way. When the system load is low, the RAM is filled mostly by the disk caches and the few running processes can benefit from the information stored in them. However, when the system load increases, the RAM is filled mostly by pages of the processes and the caches are shrunken to make room for additional processes.
正如我们在前面的章节中看到的,内存和磁盘缓存都会获取越来越多的页帧,但不会释放任何页帧。这是合理的,因为缓存系统不知道进程是否以及何时重用某些缓存数据,因此无法识别应该释放的缓存部分。此外,由于需求分页第9章中描述的机制,用户态进程只要继续执行就可以获得页框;然而,请求分页无法强制进程在不再使用页框时释放它们。
As we saw in previous chapters, both memory and disk caches grab more and more page frames but never release any of them. This is reasonable because cache systems don't know if and when processes will reuse some of the cached data and are therefore unable to identify the portions of cache that should be released. Moreover, thanks to the demand paging mechanism described in Chapter 9, User Mode processes get page frames as long as they proceed with their execution; however, demand paging has no way to force processes to release the page frames whenever they are no longer used.
因此,迟早所有空闲内存都会分配给进程和缓存。页面框架回收 Linux 内核的算法通过从用户模式进程和内核缓存中“窃取”页帧来重新填充伙伴系统的空闲块列表。
Thus, sooner or later all the free memory will be assigned to processes and caches. The page frame reclaiming algorithm of the Linux kernel refills the lists of free blocks of the buddy system by "stealing" page frames from both User Mode processes and kernel caches.
实际上,页框回收必须在所有可用内存用完之前执行 。否则,内核可能很容易陷入致命的内存请求链,从而导致系统崩溃。本质上,要释放页框,内核必须将其数据写入磁盘;然而,要完成此操作,内核需要另一个页帧(例如,为 I/O 数据传输分配缓冲区头)。如果不存在空闲页框,则无法释放任何页框。
Actually, page frame reclaiming must be performed before all the free memory has been used up. Otherwise, the kernel might be easily trapped in a deadly chain of memory requests that leads to a system crash. Essentially, to free a page frame the kernel must write its data to disk; however, to accomplish this operation, the kernel requires another page frame (for instance, to allocate the buffer heads for the I/O data transfer). If no free page frame exists, no page frame can be freed.
因此,页帧回收的目标之一是保留最小的空闲页帧池,以便内核可以安全地从“内存不足”情况中恢复。
One of the goals of page frame reclaiming is thus to conserve a minimal pool of free page frames so that the kernel may safely recover from "low on memory" conditions.
页框回收算法( PFRA)的目标 ) 就是拾取页框并使其自由。显然 PFRA 选择的页框必须是非
自由的 ,也就是说,它们不能已经包含在free_area好友系统使用的数组之一中(参见第 8 章中的“好友系统算法”一节)。
The objective of the page frame reclaiming algorithm
(PFRA ) is to pick up page frames and make them free. Clearly
the page frames selected by the PFRA must be
non-free , that is, they must not be already included in one of
the free_area arrays used by the
buddy system (see the section "The Buddy System
Algorithm" in Chapter
8).
PFRA 根据页框的内容以不同的方式处理页框。我们可以区分 不可回收的页面,可交换页面,可同步页面和可丢弃的页面。表 17-1对这些类型进行了解释。
The PFRA handles the page frames in different ways, according to their contents. We can distinguish between unreclaimable pages, swappable pages, syncable pages, and discardable pages. These types are explained in Table 17-1.
表 17-1。PFRA 考虑的页面类型
Table 17-1. The types of pages considered by the PFRA
在上表中,如果一个页面映射了文件的一部分,则该页面被称为已 映射。例如,属于文件内存映射的用户模式地址空间中的所有页面以及页面高速缓存中包含的任何其他页面都会被映射。几乎在所有情况下,映射页都是可同步的:为了回收页框,内核必须检查该页是否脏,并在必要时将页内容写入相应的磁盘文件中。
In the above table, a page is said to be mapped if it maps a portion of a file. For instance, all pages in the User Mode address spaces belonging to file memory mappings are mapped, as well as any other page included in the page cache. In almost all cases, mapped pages are syncable: in order to reclaim the page frame, the kernel must check whether the page is dirty and, if necessary, write the page contents in the corresponding disk file.
相反,页面被称为匿名页面 如果它属于进程的匿名内存区域(例如,进程的用户模式堆或堆栈中的所有页面都是匿名的)。为了回收页框,内核必须将页内容保存在称为“交换区”的专用磁盘分区或磁盘文件中(参见后面章节“交换”);因此,所有匿名页面都是可交换的。
Conversely, a page is said to be anonymous if it belongs to an anonymous memory region of a process (for instance, all pages in the User Mode heap or stack of a process are anonymous). In order to reclaim the page frame, the kernel must save the page contents in a dedicated disk partition or disk file called "swap area" (see the later section "Swapping"); therefore, all anonymous pages are swappable.
通常,特殊文件系统的页面是不可回收的。唯一的例外是tmpfs特殊文件系统的页面 ,可以通过将它们保存在交换区域中来回收它们。正如我们将在第 19 章中看到的,tmpfs特殊文件系统由 IPC 共享内存机制使用。
Usually, the pages of special filesystems are not reclaimable. The only exceptions are the pages of the tmpfs special filesystem, which can be reclaimed by saving them in a swap area. As we'll see in Chapter 19, the tmpfs special filesystem is used by the IPC shared memory mechanism.
当 PFRA 必须回收属于进程用户模式地址空间的页框时,必须考虑该页框是共享的还是 非共享的 。共享页框属于多个用户模式地址空间,而非共享页框只属于一个。请注意,非共享页框可能属于引用同一内存描述符的多个轻量级进程。
When the PFRA must reclaim a page frame belonging to the User Mode address space of a process, it must take into consideration whether the page frame is shared or non-shared . A shared page frame belongs to multiple User Mode address spaces, while a non-shared page frame belongs to just one. Notice that a non-shared page frame might belong to several lightweight processes referring to the same memory descriptor.
共享页面框架通常是在进程生成子进程时创建的;正如第 9 章“写时复制”部分所述,子级的页表是从父级的页表中复制的,因此父级和子级共享相同的页框。另一种常见情况发生在两个或多个进程通过共享内存映射访问同一文件时(请参阅第 16 章中的“内存映射”部分)。[ * ]
Shared page frames are typically created when a process spawns a child; as explained in the section "Copy On Write" in Chapter 9, the page tables of the child are copied from those of the parent, thus parent and child share the same page frames. Another common case occurs when two or more processes access the same file by means of a shared memory mapping (see the section "Memory Mapping" in Chapter 16).[*]
虽然很容易识别内存回收的候选页面(粗略地说,任何属于磁盘或内存缓存或进程的用户模式地址空间的页面),但选择正确的目标页面可能是内核中最敏感的问题设计。
While it is easy to identify the page candidates for memory reclaiming—roughly speaking, any page belonging to a disk or memory cache, or to the User Mode address space of a process—selecting the proper target pages is perhaps the most sensitive issue in kernel design.
事实上,虚拟内存子系统开发人员最困难的工作是找到一种算法,确保台式机(内存请求非常有限,但系统响应能力至关重要)和高级计算机都具有可接受的性能。大型数据库服务器等机器(其内存请求往往很大)。
As a matter of fact, the hardest job of a developer working on the virtual memory subsystem consists of finding an algorithm that ensures acceptable performance both for desktop machines (on which memory requests are quite limited but system responsiveness is crucial) and for high-level machines such as large database servers (on which memory requests tend to be huge).
不幸的是,找到一个好的页框回收算法是一项相当凭经验的工作,几乎没有理论支持。这种情况有点类似于评估决定进程动态优先级的因素:主要目标是以这种方式调整参数以获得良好的系统性能,而不询问太多关于为什么它运行良好的问题。通常,这只是“让我们尝试这种方法,看看会发生什么”的问题。这种经验设计的一个令人不快的副作用是代码变化很快。因此,我们无法确保我们将要描述的内存回收算法(Linux 2.6.11 中使用的算法)在您阅读本章时完全相同,作为最新版本的 Linux 2.6 内核所采用的一种。然而,这里描述的一般思想和主要启发式规则应该继续成立。
Unfortunately, finding a good page frame reclaiming algorithm is a rather empirical job, with very little support from theory. The situation is somewhat similar to evaluating the factors that determine the dynamic priority of a process: the main objective is to tune the parameters in such a way to achieve good system performance, without asking too many questions about why it works well. Often, it's just a matter of "let's try this approach and see what happens." An unpleasant side effect of this empirical design is that the code changes quickly. For that reason, we cannot ensure that the memory reclaiming algorithm we are going to describe—the one used in Linux 2.6.11—will be exactly the same, by the time you'll read this chapter, as the one adopted by the most up-to-date version of the Linux 2.6 kernel. However, the general ideas and the main heuristic rules described here should continue to hold.
离树叶太近可能会导致我们错过整个森林。因此,让我们介绍一下 PFRA 采用的一些一般规则。这些规则嵌入在本章稍后将描述的函数中。
Looking too close to the trees' leaves might lead us to miss the whole forest. Therefore, let us present a few general rules adopted by the PFRA. These rules are embedded in the functions that will be described later in this chapter.
磁盘和内存缓存中包含的未被任何进程引用的页面应在属于进程的用户模式地址空间的页面之前被回收;在前一种情况下,实际上可以在不修改任何页表条目的情况下完成页框回收。正如我们将在本章后面的“最近最少使用(LRU)列表”部分中看到的,通过引入“交换趋势因子”,该规则在一定程度上得到了缓解。
Pages included in disk and memory caches not referenced by any process should be reclaimed before pages belonging to the User Mode address spaces of the processes; in the former case, in fact, the page frame reclaiming can be done without modifying any Page Table entry. As we will see in the section "The Least Recently Used (LRU) Lists" later in this chapter, this rule is somewhat mitigated by introducing a "swap tendency factor."
除锁定页面外,PFRA 必须能够窃取用户模式进程的任何页面,包括匿名页面。这样,长时间休眠的进程将逐渐丢失所有页框。
With the exception of locked pages, the PFRA must be able to steal any page of a User Mode process, including the anonymous pages. In this way, processes that have been sleeping for a long period of time will progressively lose all their page frames.
当PFRA想要释放由多个进程共享的页框时,它会清除引用共享页框的所有页表条目,然后回收该页框。
When the PFRA wants to free a page frame shared by several processes, it clears all page table entries that refer to the shared page frame, and then reclaims the page frame.
PFRA 使用简化的最近最少使用 (LRU) 替换算法将页面分类为 正在使用 并未使用。[ * ]如果某个页面长时间没有被访问,那么近期被访问的概率较低,可以认为是“未使用”;另一方面,如果最近访问过某个页面,则该页面将继续被访问的概率很高,并且必须将其视为“正在使用”。PFRA 仅回收未使用的页面。这只是局部性原理的另一个应用在第2章“硬件缓存” 一节中提到过。
LRU 算法背后的主要思想是将存储页面年龄的计数器与 RAM 中的每个页面相关联,即自上次访问该页面以来经过的时间间隔。此计数器允许 PFRA 仅回收任何进程中最旧的页面。一些计算机平台为 LRU 算法提供复杂的支持;[ † ]不幸的是,80 × 86 处理器不提供这样的硬件功能,因此 Linux 内核不能依赖页面计数器来跟踪每个页面的年龄。为了应对这个限制,Linux 利用了Accessed每个页表项中包含的位,该位在访问页面时由硬件自动设置;此外,页面的年龄由页面描述符在两个不同列表之一中的位置表示(请参阅本章后面的“最近最少使用(LRU)列表”部分)。
The PFRA uses a simplified Least Recently Used (LRU) replacement algorithm to classify pages as in-use and unused.[*] If a page has not been accessed for a long time, the probability that it will be accessed in the near future is low and it can be considered "unused;" on the other hand, if a page has been accessed recently, the probability that it will continue to be accessed is high and it must be considered as "in-use." The PFRA reclaims only unused pages. This is just another application of the locality principle mentioned in the section "Hardware Cache" in Chapter 2.
The main idea behind the LRU algorithm is to associate a
counter storing the age of the page with each page in RAM—that
is, the interval of time elapsed since the last access to the
page. This counter allows the PFRA to reclaim only the oldest
page of any process. Some computer platforms provide
sophisticated support for LRU algorithms;[†] unfortunately, 80 × 86 processors do not offer
such a hardware feature, thus the Linux kernel cannot rely on a
page counter that keeps track of the age of every page. To cope
with this restriction, Linux takes advantage of the Accessed bit included in each Page
Table entry, which is automatically set by the hardware when the
page is accessed; moreover, the age of a page is represented by
the position of the page descriptor in one of two different
lists (see the section "The Least Recently Used
(LRU) Lists" later in this chapter).
因此,页框回收算法是多种启发式算法的混合:
Therefore, the page frame reclaiming algorithm is a blend of several heuristics:
仔细选择检查缓存的顺序。
Careful selection of the order in which caches are examined.
基于老化的页面排序(最近最少使用的页面应在最近访问的页面之前释放)。
Ordering of pages based on aging (least recently used pages should be freed before pages accessed recently).
基于页面状态来区分页面(例如,非脏页面比脏页面是更好的候选者,因为它们不必写入磁盘)。
Distinction of pages based on the page state (for example, non-dirty pages are better candidates than dirty pages because they don't have to be written to disk).
[ * ]但需要注意的是,当单个进程通过共享内存映射访问文件时,就 PFRA 而言,相应的页面是非共享的。类似地,属于私有内存映射的页面可以被视为由 PFRA 共享(例如,因为两个进程读取相同的文件部分并且它们都没有修改该页面中的数据)。
[*] It should be noted, however, that when a single process accesses a file through a shared memory mapping, the corresponding pages are non-shared as far as the PFRA is concerned. Similarly, a page belonging to a private memory mapping may be treated as shared by the PFRA (for instance, because two processes read the same file portion and none of them modified the data in the page).
如上一节所述,PFRA 的目标之一是能够释放共享页框架。为此,Linux 2.6 内核能够快速定位所有指向同一页框的页表条目。此活动称为 反向映射 。
As stated in the previous section, one of the objectives of the PFRA is to be able to free a shared page frame. To that end, the Linux 2.6 kernel is able to locate quickly all the Page Table entries that point to the same page frame. This activity is called reverse mapping .
反向映射的一个简单解决方案是在每个页面描述符中包含附加字段,以将指向与该页面描述符关联的页面框架的所有页面表条目链接在一起。然而,保持此类列表最新会显着增加内核开销;为此,人们设计了更复杂的解决方案。Linux 2.6中使用的技术被称为 基于对象反向映射。本质上,对于任何可回收的用户模式页面,内核都会存储指向系统中所有内存区域(“对象”)的反向链接,其中包括页面本身。每个内存区域描述符存储一个指向内存描述符的指针,该内存描述符又包含一个指向页面全局目录的指针。因此,反向链接使 PFRA 能够检索引用给定页面的所有页表条目。由于内存区域描述符比页面描述符少,因此更新共享页面的反向链接耗时较少。让我们看看这个方案是如何制定的。
A trivial solution for reverse mapping would be to include in each page descriptor additional fields to link together all the Page Table entries that point to the page frame associated with the page descriptor. However, keeping such lists up-to-date would increase significantly the kernel overhead; for that reason, more sophisticated solutions have been devised. The technique used in Linux 2.6 is named object-based reverse mapping. Essentially, for any reclaimable User Mode page, the kernel stores the backward links to all memory regions in the system (the "objects") that include the page itself. Each memory region descriptor stores a pointer to a memory descriptor, which in turn includes a pointer to a Page Global Directory. Therefore, the backward links enable the PFRA to retrieve all Page Table entries referencing a given a page. Because there are fewer memory region descriptors than page descriptors, updating the backward links of a shared page is less time consuming. Let's see how this scheme is worked out.
首先,PFRA 必须有一种方法来确定要回收的页面是共享的还是非共享的,以及是映射的还是匿名的。为了做到这一点,内核查看页面描述符的两个字段:_mapcount和
mapping。
First of all, the PFRA must have a way to determine whether the
page to be reclaimed is shared or non-shared, and whether it is mapped
or anonymous. In order to do this, the kernel looks at two fields of the
page descriptor: _mapcount and
mapping.
该_mapcount字段存储引用页框的页表条目的数量。计数器从-1 开始:该值表示没有页表条目引用该页框。因此,如果计数器为零,则该页是非共享的,而如果它大于零,则该页是共享的。该page_mapcount( )
函数接收页面描述符的地址并返回其加一的值_mapcount(因此,例如,它为某个进程的用户模式地址空间中包含的非共享页面返回一)。
The _mapcount field stores the
number of Page Table entries that refer to the page frame. The counter
starts from -1: this value means that
no Page Table entry references the page frame. Thus, if the counter is
zero, the page is non-shared, while if it is greater than zero the page
is shared. The page_mapcount( )
function receives the address of a page descriptor and returns the value
of its _mapcount plus one (thus, for
instance, it returns one for a non-shared page included in the User Mode
address space of some process).
页面描述符的字段mapping决定该页面是映射的还是匿名的,如下:
The mapping field of the page
descriptor determines whether the page is mapped or anonymous, as
follows:
如果该mapping字段为
NULL,则该页属于交换缓存(请参阅本章后面的“交换缓存”部分)。
If the mapping field is
NULL, the page belongs to the
swap cache (see the section "The Swap Cache" later
in this chapter).
如果映射字段不是NULL,并且其最低有效位为 1,则意味着该页是匿名的,并且该字段mapping将指针编码为
anon_vma描述符(请参阅下一节“匿名页的反向映射”)。
If the mapping field is not NULL and its least significant bit is 1,
it means the page is anonymous and the mapping field encodes the pointer to an
anon_vma descriptor (see the next
section, "Reverse
Mapping for Anonymous Pages").
如果该mapping字段为非NULL且其最低有效位为0,则该页被映射;该mapping字段指向address_space相应文件的对象(参见第15章“地址空间对象”一节)。
If the mapping field is
non-NULL and its least
significant bit is 0, the page is mapped; the mapping field points to the address_space object of the corresponding
file (see the section "The address_space
Object" in Chapter
15).
Linux 使用的每个address_space对象都在 RAM 中对齐,因此其起始线性地址是 4 的倍数。因此,该字段的最低有效位mapping可以用作指示该字段是否包含指向对象address_space或anon_vma描述符的指针的标志。这是一个肮脏的编程技巧,但是内核使用了大量的页面描述符,因此这些数据结构应该尽可能小。该函数接收页面描述符的地址作为其参数,如果设置了PageAnon( )该字段的最低有效位,则返回 1 ,否则返回 0。mapping
Every address_space object used
by Linux is aligned in RAM so that its starting linear address is a
multiple of four. Therefore, the least significant bit of the mapping field can be used as a flag denoting
whether the field contains a pointer to an address_space object or to an anon_vma descriptor. This is a dirty
programming trick, but the kernel uses a lot of page descriptors, thus
these data structures should be as small as possible. The PageAnon( ) function receives as its parameter
the address of a page descriptor and returns 1 if the least significant
bit of the mapping field is set, 0
otherwise.
该try_to_unmap( )函数接收指向页面描述符的指针作为其参数,并尝试清除指向与该页面描述符关联的页框的所有页表条目。如果函数SWAP_SUCCESS成功从所有页表条目中删除对页框的任何引用,则该函数返回(零);SWAP_AGAIN如果无法删除某些引用,则返回(一);SWAP_FAIL如果出现错误,则返回(二)。该函数非常短:
The try_to_unmap( ) function,
which receives as its parameter a pointer to a page descriptor, tries to
clear all the Page Table entries that point to the page frame associated
with that page descriptor. The function returns SWAP_SUCCESS (zero) if the function succeeded
in removing any reference to the page frame from all Page Table entries,
it returns SWAP_AGAIN (one) if some
reference could not be removed, and returns SWAP_FAIL (two) in case of errors. The
function is quite short:
int try_to_unmap(结构页*页)
{
int ret;
if (PageAnon(页))
ret = try_to_unmap_anon(页面);
别的
ret = try_to_unmap_file(页面);
if (!page_mapped(页))
ret = SWAP_SUCCESS;
返回ret;
}int try_to_unmap(struct page *page)
{
int ret;
if (PageAnon(page))
ret = try_to_unmap_anon(page);
else
ret = try_to_unmap_file(page);
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
}和try_to_unmap_anon( )函数
try_to_unmap_file( )分别处理匿名页面和映射页面。这些功能将在接下来的章节中描述。
The try_to_unmap_anon( ) and
try_to_unmap_file( ) functions take
care of anonymous pages and mapped pages, respectively. These functions
will be described in the forthcoming sections.
匿名页面通常在多个进程之间共享。最常见的情况发生在派生新进程时:如第 9 章“写入时复制”部分所述,父进程拥有的所有页框(包括匿名页)也分配给子进程。当进程创建指定 和标志的内存区域时,会发生另一种(非常不寻常的)情况:此类区域的页面将在该进程的未来后代之间共享。MAP_ANONYMOUSMAP_SHARED
Anonymous pages are often shared among several
processes. The most common case occurs when forking a new process: as
explained in the section "Copy On Write" in Chapter 9, all page frames owned
by the parent—including the anonymous pages—are assigned also to the
child. Another (quite unusual) case occurs when a process creates a
memory region specifying both the MAP_ANONYMOUS and MAP_SHARED flag: the pages of such a region
will be shared among the future descendants of the process.
将引用同一页框的所有匿名页链接在一起的策略很简单:包含该页框的匿名内存区域被收集在一个双向链接的循环列表中。请注意,即使匿名内存区域包含不同的页面,该区域中的所有页框也始终只有一个反向映射列表。
The strategy to link together all the anonymous pages that refer to the same page frame is simple: the anonymous memory regions that include the page frame are collected in a doubly linked circular list. Be warned that, even if an anonymous memory region includes different pages, there always is just one reverse mapping list for all the page frames in the region.
当内核将第一个页框分配给匿名区域时,它会创建一个新的anon_vma
数据结构,其中仅包含两个字段:lock,用于保护列表免受竞争条件影响的自旋锁,以及head,内存双向链接循环列表的头部区域描述符。然后,内核将vm_area_struct匿名内存区域的描述符插入到anon_vma's链表中;为此,vm_area_struct数据结构包括与该列表相关的两个字段:anon_vma_node存储指向列表中下一个和前一个元素的指针,同时anon_vma存储指向anon_vma数据结构的指针。最后,内核将数据结构的地址存储anon_vma
在mapping匿名页面描述符的字段,如前所述。见图
17-1。
When the kernel assigns the first page frame to an anonymous
region, it creates a new anon_vma
data structure, which includes just two fields: lock, a spin lock for protecting the list
against race conditions, and head,
the head of the doubly linked circular list of memory region
descriptors. Then, the kernel inserts the vm_area_struct descriptor of the anonymous
memory region in the anon_vma's
list; to that end, the vm_area_struct data structure includes two
fields related to this list: anon_vma_node stores the pointers to the
next and previous elements in the list, while anon_vma points to the anon_vma data structure. Finally, the kernel
stores the address of the anon_vma
data structure in the mapping field
of the descriptor of the anonymous page, as described previously. See
Figure 17-1.
当一个进程已经引用的页框被插入到另一个进程的页表条目中时(例如,由于fork( ) 系统调用,参见
When a page frame already referenced by one process is inserted
into a Page Table entry of another process (for instance, as a
consequence of a fork( ) system call, see
第 3 章中的“ clone( )、fork( ) 和 vfork( ) 系统调用”部分);内核只是将第二个进程的匿名内存区域插入到第一个进程内存区域的字段所指向的数据结构
的双向链接循环列表中。因此,any的列表通常包括不同进程拥有的内存区域。[ * ]anon_vmaanon_vmaanon_vma
the section "The
clone( ), fork( ), and vfork( ) System Calls" in Chapter 3); the kernel simply
inserts the anonymous memory region of the second process in the
doubly linked circular list of the anon_vma data structure pointed to by the
anon_vma field of the first
process's memory region. Therefore, any anon_vma's list typically includes memory
regions owned by different processes.[*]
如图17-1所示, 的anon_vma列表允许内核快速定位引用同一匿名页框的所有页表条目。事实上,每个区域描述符在该vm_mm字段中存储内存描述符的地址,该地址又包括一个包含
pgd进程的页面全局目录地址的字段。index然后可以通过考虑匿名页的起始线性地址来确定页表条目,该地址可以从内存区域描述符和页描述符的字段轻松获得。
As shown in Figure
17-1, the anon_vma's list
allows the kernel to quickly locate all Page Table entries that refer
to the same anonymous page frame. In fact, each region descriptor
stores in the vm_mm field the
address of the memory descriptor, which in turn includes a field
pgd containing the address of the
Page Global Directory of the process. The Page Table entry can then be
determined by considering the starting linear address of the anonymous
page, which is easily obtained from the memory region descriptor and
the index field of the page
descriptor.
当回收匿名页框时,PFRA必须扫描列表中的所有内存区域anon_vma,并仔细检查每个区域是否确实包含一个其底层页框是目标页框的匿名页。这项工作是由该try_to_unmap_anon( )
函数完成的,该函数接收目标页框的描述符作为其参数,并基本上执行以下步骤:
When reclaiming an anonymous page frame, the PFRA must scan
all memory regions in the anon_vma's list and carefully check
whether each region actually includes an anonymous page whose
underlying page frame is the target page frame. This job is done by
the try_to_unmap_anon( )
function, which receives as its parameter the descriptor of the
target page frame and performs essentially the following
steps:
获取页描述符字段指向的数据结构lock的自旋锁。anon_vmamapping
Acquires the lock spin
lock of the anon_vma data
structure pointed to by the mapping field of the page
descriptor.
扫描 的anon_vma内存区域描述符列表;对于该列表中找到的每个vma内存区域描述符,它调用try_to_unmap_one( )作为参数vma和页面描述符传递的函数(见下文)。如果由于某种原因该函数返回一个SWAP_FAIL值,或者如果_mapcount页面描述符的字段指示已找到引用该页框的所有页表条目,则扫描在到达列表末尾之前终止。
Scans the anon_vma's
list of memory region descriptors; for each vma memory region descriptor found in
that list, it invokes the try_to_unmap_one( ) function passing
as parameters vma and the
page descriptor (see below). If for some reason this function
returns a SWAP_FAIL value, or
if the _mapcount field of the
page descriptor indicates that all Page Table entries
referencing the page frame have been found, the scanning
terminates before reaching the end of the list.
释放步骤 1 中获得的自旋锁。
Releases the spin lock obtained in step 1.
返回最后一次调用计算的值
try_to_unmap_one( ):(SWAP_AGAIN部分成功)或
SWAP_FAIL(部分失败)。
Returns the value computed by the last invocation of
try_to_unmap_one( ): SWAP_AGAIN (partial success) or
SWAP_FAIL (failure).
该函数会被 from和 fromtry_to_unmap_one(
)重复调用。它作用于两个参数:指向目标页面描述符的指针和指向内存区域描述符的指针。该函数主要执行以下操作:try_to_unmap_anon( )try_to_unmap_file( )pagevma
The try_to_unmap_one(
) function is called repeatedly both from try_to_unmap_anon( ) and from try_to_unmap_file( ). It acts on two
parameters: a pointer page to a
target page descriptor and a pointer vma to a memory region descriptor. The
function essentially performs the following actions:
根据内存区域的起始线性地址 ( vma->vm_start)、内存区域在映射文件中的偏移量 ( vma->vm_pgoff) 以及页面在映射文件内的偏移量 ( page->index) 计算要回收的页的线性地址。对于匿名页面,该vma->vm_pgoff字段为零或等于vm_start/PAGE_SIZE;相应地,该page->index字段要么是该区域内页面的索引,要么是该页面除以的线性地址PAGE_SIZE。
Computes the linear address of the page to be reclaimed
from the starting linear address of the memory region (vma->vm_start), the offset of the
memory region in the mapped file (vma->vm_pgoff), and the offset of
the page inside the mapped file (page->index). For anonymous pages,
the vma->vm_pgoff field is
either zero or equal to vm_start/PAGE_SIZE; correspondingly,
the page->index field is
either the index of the page inside the region or the linear
address of the page divided by PAGE_SIZE.
如果目标页是匿名的,则检查其线性地址是否落在内存区域内;如果不是,则通过返回 来终止SWAP_AGAIN。(正如在引入匿名页面的反向映射时所解释的那样, 的anon_vma列表可能包括不包含目标页面的内存区域。)
If the target page is anonymous, it checks whether its
linear address falls inside the memory region; if not, it
terminates by returning SWAP_AGAIN. (As explained when
introducing reverse mapping for anonymous pages, the anon_vma's list may include memory
regions that do not contain the target page.)
从 中获取内存描述符的地址vma->vm_mm,并获取
vma->vm_mm->page_table_lock保护页表的自旋锁。
Gets the address of the memory descriptor from vma->vm_mm, and acquires the
vma->vm_mm->page_table_lock spin
lock that protects the page tables.
依次调用pgd_offset(
)、pud_offset( )、
pmd_offset( )、 ,pte_offset_map( )获取目标页线性地址对应的页表项地址。
Invokes successively pgd_offset(
), pud_offset( ),
pmd_offset( ), and pte_offset_map( ) to get the address
of the Page Table entry that corresponds to the linear address
of the target page.
执行一些检查以验证目标页面是否可有效回收。如果以下任何检查失败,该函数将跳转到步骤 12,并通过返回正确的错误号或 来SWAP_AGAIN终止SWAP_FAIL:
检查页表条目是否指向目标页;如果不是,该函数返回SWAP_AGAIN。在以下情况下可能会发生这种情况:
检查内存区域是否未锁定 ( VM_LOCKED) 或保留 ( VM_RESERVED);如果满足这些限制之一,则该函数返回SWAP_FAIL。
检查Accessed页表项内的位是否已清除;如果不是,该函数清除该位并返回SWAP_FAIL。如果该
Accessed位被设置,则该页被视为正在使用,因此不应回收它。
检查该页是否属于交换缓存(参见本章后面的“交换缓存”部分)并且当前正在被处理(参见第9章“分配线性地址间隔”get_user_pages(
)部分);在这种情况下,为了避免令人讨厌的竞争条件,该函数返回
。SWAP_FAIL
Performs a few checks to verify that the target page is
effectively reclaimable. If any of the following checks fails,
the function jumps to step 12 to terminate by returning a proper
error number, either SWAP_AGAIN or SWAP_FAIL:
Checks that the Page Table entry points to the target
page; if not, the function returns SWAP_AGAIN. This can happen in the
following cases:
The Page Table entry refers to a page frame
assigned with COW , but the anonymous memory region
identified by vma
still belongs to the anon_vma list of the original
page frame.
The mremap( )
system call may remap memory regions and
move the pages into the User Mode address space by
directly modifying the page table entries. In this
particular case, object-based reverse mapping does not
work, because the index field of the page
descriptor cannot be used to determine the actual linear
address of the page.
The file memory mapping is non-linear (see the section "Non-Linear Memory Mappings" in Chapter 16).
Checks that the memory region is not locked (VM_LOCKED) or reserved (VM_RESERVED); if one of these
restrictions is in place, the function returns SWAP_FAIL.
Checks that the Accessed bit inside the Page Table
entry is cleared; if not, the function clears the bit and
returns SWAP_FAIL. If the
Accessed bit is set, the
page is considered in-use, thus it should not be
reclaimed.
Checks whether the page belongs to the swap cache (see
the section "The Swap
Cache" later in this chapter) and it is currently
being handled by get_user_pages(
) (see the section "Allocating a Linear
Address Interval" in Chapter 9); in this
case, to avoid a nasty race condition, the function returns
SWAP_FAIL.
该页可以被回收:如果Dirty页表条目中的位被设置,则设置PG_dirty该页的标志。
The page can be reclaimed: if the Dirty bit in the Page Table entry is
set, sets the PG_dirty flag
of the page.
清除页表条目并刷新相应的 TLB。
Clears the Page Table entry and flushes the corresponding TLBs.
如果页面是匿名的,该函数会在页表条目中插入一个换出的页面标识符,以便进一步访问该页面时将交换该页面(请参阅本章后面的“交换”部分)。anon_rss
此外,它还减少了存储在内存描述符字段中的匿名页面的计数器。
If the page is anonymous, the function inserts a
swapped-out page identifier in the Page Table entry so that
further accesses to this page will swap in the page (see the
section "Swapping" later in
this chapter). Moreover, it decreases the counter of anonymous
pages stored in the anon_rss
field of the memory descriptor.
rss
减少分配给存储在内存描述符字段中的进程的页帧计数器。
Decreases the counter of page frames allocated to the
process stored in the rss
field of the memory descriptor.
减少_mapcount
页面描述符的字段,因为用户模式页表条目中对此页框的引用已被删除。
Decreases the _mapcount
field of the page descriptor, because a reference to this page
frame in the User Mode Page Table entries has been
deleted.
减少页框的使用计数器,该计数器存储在_count页描述符字段中。如果计数器变为负值,它将从活动或非活动列表中删除页面描述符(请参阅本章后面的“最近最少使用(LRU)列表free_hot_page( )”部分),并调用释放页框(请参阅“最近最少使用(LRU)列表”部分)每 CPU 页帧缓存”(第 8 章)。
Decreases the usage counter of the page frame, which is
stored in the _count field of
the page descriptor. If the counter becomes negative, it removes
the page descriptor from the active or inactive list (see the
section "The Least
Recently Used (LRU) Lists" later in this chapter), and
invokes free_hot_page( ) to
release the page frame (see the section "The Per-CPU Page Frame
Cache" in Chapter
8).
调用释放pte_unmap( )可以在步骤 4 中分配的临时内核映射(请参阅第 8 章中的“高内存页帧的内核映射”pte_offset_map(
)部分)。
Invokes pte_unmap( ) to
release the temporary kernel mapping that could have been
allocated by pte_offset_map(
) in step 4 (see the section "Kernel Mappings of
High-Memory Page Frames" in Chapter 8).
释放vma->vm_mm->page_table_lock在步骤 3 中获取的自旋锁。
Releases the vma->vm_mm->page_table_lock spin
lock acquired in step 3.
返回正确的错误代码(SWAP_AGAIN如果成功)。
Returns the proper error code (SWAP_AGAIN in case of success).
与匿名页一样,映射页的基于对象的反向映射基于一个简单的想法:通过访问包含相应映射的内存区域的描述符,始终可以检索引用给定页框的页表条目。页。因此,反向映射的核心是一个巧妙的数据结构,它收集与给定页框相关的所有内存区域描述符。
As with anonymous pages, object-based reverse mapping for mapped pages is based on a simple idea: it is always possible to retrieve the Page Table entries that refer to a given page frame by accessing the descriptors of the memory regions that include the corresponding mapped pages. Thus, the core of reverse mapping is a clever data structure that collects all memory region descriptors relative to a given page frame.
我们在上一节中已经看到,匿名内存区域的描述符被收集在双向链接的循环列表中;检索引用给定页框的所有页表条目涉及对列表中的元素进行线性扫描。共享匿名页面框架的数量从来都不是很大,因此这种方法效果很好。
We have seen in the previous section that descriptors for anonymous memory regions are collected in doubly linked circular lists; retrieving all page table entries referencing a given page frame involves a linear scanning of the elements in the list. The number of shared anonymous page frames is never very large, hence this approach works well.
与匿名页面相反,映射页面经常被共享,因为许多不同的进程可能共享相同的代码页面。例如,考虑系统中几乎所有进程共享包含标准 C 库代码的页面(参见第 20 章中的“库” 部分)。因此,Linux 2.6 依赖于特殊的搜索树,称为“优先搜索树”,”以快速定位引用同一页框的所有内存区域。
Contrary to anonymous pages, mapped pages are frequently shared, because many different processes may share the same pages of code. For instance, consider that nearly all processes in the system share the pages containing the code of the standard C library (see the section "Libraries" in Chapter 20). For this reason, Linux 2.6 relies on special search trees, called "priority search trees ," to quickly locate all the memory regions that refer to the same page frame.
每个文件都有一个优先级搜索树;它的根存储在文件对象中嵌入的对象i_mmap字段
中。总是可以快速检索搜索树的根,因为映射页面的描述符中的字段指向该对象。address_spaceinodemappingaddress_space
There is a priority search tree for every file; its root is
stored in the i_mmap field of the
address_space object embedded in
the file's inode object. It is
always possible to quickly retrieve the root of the search tree,
because the mapping field in the
descriptor of a mapped page points to the address_space object.
Linux 2.6 使用的优先级搜索树( PST )基于 Edward McCreight 于 1985 年引入的数据结构,用于表示一组重叠间隔。麦克雷特树是堆和平衡搜索树的混合体,它用于对间隔集合执行查询,例如“给定间隔中包含哪些间隔?” 以及“什么区间与给定区间相交?”——时间与树的高度和答案中的区间数成正比。
The priority search tree (PST) used by Linux 2.6 is based on a data structure introduced by Edward McCreight in 1985 to represent a set of overlapping intervals. McCreight's tree is a hybrid of a heap and a balanced search tree, and it is used to perform queries on the set of intervals—e.g., "what intervals are contained in a given interval?" and "what intervals intersect a given interval?"—in an amount of time directly proportional to the height of the tree and the number of intervals in the answer.
PST 中的每个区间对应于树的一个节点,它由两个索引来表征:基索引,对应于区间的起点,堆索引,对应于终点。PST 本质上是基数索引上的搜索树,具有附加的类似堆的属性,即节点的堆索引永远不会小于其子节点的堆索引。
Each interval in a PST corresponds to a node of the tree, and it is characterized by two indices: the radix index, which corresponds to the starting point of the interval, and the heap index, which corresponds to the final point. The PST is essentially a search tree on the radix index, with the additional heap-like property that the heap index of a node is never smaller than the heap indices of its children.
Linux优先级搜索树与McCreight的数据结构有两个重要的不同之处:首先,Linux树并不总是保持平衡(平衡算法在内存空间和执行时间上都是昂贵的);其次,Linux树被调整为存储内存区域而不是线性区间。
The Linux priority search tree differs from McCreight's data structure in two important aspects: first, the Linux tree is not always kept balanced (the balancing algorithm is costly both in memory space and in execution time); second, the Linux tree is adapted so as to store memory regions instead of linear intervals.
每个内存区域可以被视为由文件中的初始位置(基数索引)和最终位置(堆索引)标识的文件页区间。然而,内存区域往往从相同的页面开始(通常从页面索引 0 开始)。不幸的是,McCreight 的原始数据结构无法存储具有完全相同起点的区间。作为部分解决方案,PST 的每个节点都带有一个附加的大小索引(基数和堆索引除外),对应于内存区域的大小(以页为单位减一)。大小索引允许搜索程序区分从同一文件位置开始的不同内存区域。
Each memory region can be considered as an interval of file pages identified by the initial position in the file (the radix index) and the final position (the heap index). However, memory regions tend to start from the same pages (typically, from page index 0). Unfortunately, McCreight's original data structure cannot store intervals having the very same starting point. As a partial solution, each node of a PST carries an additional size index—other than the radix and heap indices—corresponding to the size of the memory region in pages minus one. The size index allows the search program to distinguish different memory regions that start at the same file position.
然而,大小索引显着增加了可能最终出现在 PST 中的不同节点的数量。特别是,如果有太多具有相同基数索引但不同堆索引的节点,则 PST 无法包含所有这些节点。为了解决这个问题,PST可以包括 以PST的叶为根并且包含具有公共基数树的节点的溢出子树。
The size index, however, increases significantly the number of different nodes that may end up in a PST. In particular, if there are too many nodes having the same radix index but different heap indices, the PST could not contain all of them. To solve this problem, the PST may include overflow subtrees rooted at the leaves of the PST and containing nodes having a common radix tree.
此外,不同的进程可能拥有映射同一文件的完全相同部分的内存区域(只需考虑上面提到的标准 C 库的示例)。在这种情况下,与这些内存区域对应的所有节点都具有相同的基数、堆和大小索引。当内核必须在 PST 中插入与已存在节点具有相同索引的内存区域时,它将内存区域描述符插入以较旧的 PST 节点为根的双向链接循环列表中。
Furthermore, different processes may own memory regions that map exactly the same portion of the same file (just consider the example of the standard C library mentioned above). In that case, all nodes corresponding to these memory regions have the same radix, heap, and size indices . When the kernel must insert in a PST a memory region having the same indices as the ones of a node already existing, it inserts the memory region descriptor in a doubly linked circular list rooted at the older PST node.
图17-2 显示优先级搜索树的简单示例。在图的左侧,我们显示了覆盖文件前六页的七个内存区域;每个区间都标有基数索引、大小索引和堆索引。在图的右侧,我们画出了对应的PST。请注意,没有子节点的堆索引大于父节点的堆索引。还要观察到任何节点的左子节点的基数索引永远不会大于右子节点的基数索引;如果基数索引之间存在平局,则按大小索引给出排序。让我们假设 PFRA 必须检索包括索引 5 处的页面的所有内存区域。搜索算法从根 (0,5,5) 开始:因为相应的区间包括页面,所以这是第一个检索的内存区域。然后算法访问根的左孩子(0,4,4),并将堆索引(4)与页索引进行比较:因为堆索引较小,所以区间不包括页;此外,由于 PST 的类似堆的属性,该节点的所有子节点都不能包含该页面。因此算法直接跳转到根的右子节点(2,3,5)。对应的区间包含该页,因此被检索。然后算法访问子级 (1,2,3) 和 (2,0,2),但发现它们都不包含该页面。这是第一个检索的内存区域。然后算法访问根的左孩子(0,4,4),并将堆索引(4)与页索引进行比较:因为堆索引较小,所以区间不包括页;此外,由于 PST 的类似堆的属性,该节点的所有子节点都不能包含该页面。因此算法直接跳转到根的右子节点(2,3,5)。对应的区间包含该页,因此被检索。然后算法访问子级 (1,2,3) 和 (2,0,2),但发现它们都不包含该页面。这是第一个检索的内存区域。然后算法访问根的左孩子(0,4,4),并将堆索引(4)与页索引进行比较:因为堆索引较小,所以区间不包括页;此外,由于 PST 的类似堆的属性,该节点的所有子节点都不能包含该页面。因此算法直接跳转到根的右子节点(2,3,5)。对应的区间包含该页,因此被检索。然后算法访问子级 (1,2,3) 和 (2,0,2),但发现它们都不包含该页面。因为堆索引较小,所以该区间不包括页;此外,由于 PST 的类似堆的属性,该节点的所有子节点都不能包含该页面。因此算法直接跳转到根的右子节点(2,3,5)。对应的区间包含该页,因此被检索。然后算法访问子级 (1,2,3) 和 (2,0,2),但发现它们都不包含该页面。因为堆索引较小,所以该区间不包括页;此外,由于 PST 的类似堆的属性,该节点的所有子节点都不能包含该页面。因此算法直接跳转到根的右子节点(2,3,5)。对应的区间包含该页,因此被检索。然后算法访问子级 (1,2,3) 和 (2,0,2),但发现它们都不包含该页面。因此它被检索。然后算法访问子级 (1,2,3) 和 (2,0,2),但发现它们都不包含该页面。因此它被检索。然后算法访问子级 (1,2,3) 和 (2,0,2),但发现它们都不包含该页面。
Figure 17-2 shows a simple example of priority search tree. In the left side of the figure, we show seven memory regions covering the first six pages of a file; each interval is labeled with the radix index, size index, and heap index. In the right side of the figure, we draw the corresponding PST. Notice that no child node has a heap index greater than the heap index of the parent. Also observe that the radix index of the left child of any node is never greater than the radix index of the right child; in case of tie between the radix indices, the ordering is given by the size index. Let us suppose that the PFRA must retrieve all memory regions that include the page at index five. The search algorithm starts at the root (0,5,5): because the corresponding interval includes the page, this is the first retrieved memory region. Then the algorithm visits the left child (0,4,4) of the root and compares the heap index (four) with the page index: because the heap index is smaller, the interval does not include the page; moreover, thanks to the heap-like property of the PST, none of the children of this node can include the page. Thus the algorithm directly jumps to the right child (2,3,5) of the root. The corresponding interval includes the page, hence it is retrieved. Then the algorithm visits the children (1,2,3) and (2,0,2), but it discovers that neither of them include the page.
由于篇幅有限,我们无法详细描述实现 Linux PST 的数据结构和函数。我们只会提到 PST 的节点由数据结构表示prio_tree_node,该数据结构嵌入在shared.prio_tree_node每个内存区域描述符的字段中。该shared.vm_set数据结构用于(作为替代)shared.prio_tree_node将内存区域描述符插入到 PST 节点的重复列表中。vma_prio_tree_insert( )通过执行和函数可以插入和删除PST节点vma_prio_tree_remove( );它们都接收内存区域描述符的地址和 PST 根的地址作为参数。可以通过执行以下命令来执行 PST 查询vma_prio_tree_foreach宏,它对所有内存区域描述符实现循环,其中至少包括指定范围的线性地址中的一页。
We won't be able, for lack of space, to describe in detail the
data structures and the functions that implement the Linux PSTs.
We'll only mention that a node of a PST is represented by a prio_tree_node data structure, which is
embedded in the shared.prio_tree_node field of each memory
region descriptor. The shared.vm_set data structure is used—as an
alternative to shared.prio_tree_node—to insert the memory
region descriptor in a duplicate list of a PST node. PST nodes can
be inserted and removed by executing the vma_prio_tree_insert( ) and vma_prio_tree_remove( ) functions; both of
them receive as their parameters the address of a memory region
descriptor and the address of a PST root. Queries on the PST can be
performed by executing the vma_prio_tree_foreach macro, which
implements a loop over all memory region descriptors that includes
at least one page in a specified range of linear addresses.
try_to_unmap_file(
)调用该函数try_to_unmap( )来执行映射页的反向映射。当内存映射是线性时,这个函数很容易描述(参见第16章中的“内存映射”部分)。在这种情况下,它执行以下操作:
The try_to_unmap_file(
) function is invoked by try_to_unmap( ) to perform the reverse
mapping of mapped pages. This function is quite simple to describe
when the memory mapping is linear (see the section "Memory Mapping" in Chapter 16). In this case, it
performs the following actions:
获取page->mapping->i_mmap_lock自旋锁。
Gets the page->mapping->i_mmap_lock spin
lock.
将vma_prio_tree_foreach( )宏应用于其根存储在page->mapping->i_mmap字段中的优先级搜索树。vm_area_struct
对于宏找到的每个描述符,该函数都会调用try_to_unmap_one( )以尝试清除包含该页面的内存区域的页表条目(请参阅前面的“匿名页面的反向映射”部分)。如果由于某种原因该函数返回一个SWAP_FAIL值,或者如果_mapcount页面描述符的字段指示已找到引用该页框的所有页表条目,则扫描立即终止。
Applies the vma_prio_tree_foreach( ) macro to the
priority search tree whose root is stored in the page->mapping->i_mmap field. For
each vm_area_struct
descriptor found by the macro, the function invokes try_to_unmap_one( ) to try to clear
the Page Table entry of the memory region that contains the page
(see the earlier section "Reverse Mapping for
Anonymous Pages"). If for some reason this function
returns a SWAP_FAIL value, or
if the _mapcount field of the
page descriptor indicates that all Page Table entries
referencing the page frame have been found, the scanning
terminates immediately.
释放page->mapping->i_mmap_lock自旋锁。
Releases the page->mapping->i_mmap_lock spin
lock.
根据是否所有页表条目都已被清除来返回SWAP_AGAIN或。SWAP_FAIL
Returns either SWAP_AGAIN or SWAP_FAIL according to whether all
page table entries have been cleared.
如果映射是非线性的(请参阅第16章中的“非线性内存映射”部分),该函数可能无法清除某些页表条目,因为页描述符的字段通常存储页表的位置文件中的页面不再与内存区域中的页面位置相关。因此,无法确定该页的线性地址,从而无法获取页表项地址。try_to_unmap_one(
)indextry_to_unmap_one(
)
If the mapping is non-linear (see the section "Non-Linear Memory
Mappings" in Chapter
16), the try_to_unmap_one(
) function may fail to clear some Page Table entries,
because the index field of the
page descriptor, which as usual stores the position of the page in
the file, is no longer related to the position of the page in the
memory region. Therefore, try_to_unmap_one(
) cannot determine the linear address of the page, hence
it cannot get the Page Table entry address.
唯一的解决方案是在文件的所有非线性内存区域中进行详尽的搜索。以文件
对象i_mmap_nonlinear字段为根的双向链表包括文件所有非线性内存区域的描述符。对于每个这样的内存区域,调用该函数,该函数扫描与内存区域的线性地址相对应的所有页表条目并尝试清除它们。page->mappingaddress_spacetry_to_unmap_file(
)try_to_unmap_cluster( )
The only solution is an exhaustive search in all the
non-linear memory regions of the file. The doubly linked list rooted
at the i_mmap_nonlinear field of
the page->mapping file's
address_space object includes the
descriptors of all non-linear memory regions of the file. For each
such memory region, try_to_unmap_file(
) invokes the try_to_unmap_cluster( ) function, which
scans all Page Table entries corresponding to the linear addresses
of the memory region and tries to clear them.
由于搜索可能非常耗时,因此会执行有限扫描,并根据启发式规则确定要扫描的内存区域部分:描述符vm_private_data的字段vma_area_struct保存当前扫描中的当前光标。这意味着try_to_unmap_file( )在某些情况下最终可能会丢失要取消映射的页面。发生这种情况时,try_to_unmap( )发现页面仍然被映射并返回SWAP_AGAIN而不是SWAP_SUCCESS。
Because the search might be quite time-consuming, a limited
scan is performed and a heuristic rule determines the portion of the
memory region to be scanned: the vm_private_data field of the vma_area_struct descriptor holds the
current cursor in the current scan. This means that try_to_unmap_file( ) might in some cases
end up missing the page to be unmapped. When this occurs, try_to_unmap( ) discovers that the page is
still mapped and return SWAP_AGAIN instead of SWAP_SUCCESS.
[ * ] Ananon_vma的列表还可能包括同一进程拥有的几个相邻的匿名内存区域。通常,当匿名内存区域被分割为两个或多个区域时,就会发生这种情况mprotect( ) 系统调用。
[*] An anon_vma's list may
also include several adjacent anonymous memory regions owned by
the same process. Usually this occurs when an anonymous memory
region is split in two or more regions by the mprotect( ) system call.
页框回收算法必须处理用户态进程、磁盘缓存和内存缓存所拥有的多种页面;此外,它必须遵守一些启发式规则。因此,PFRA 由大量职能组成也就不足为奇了。
图17-3显示了PFRA的主要功能;箭头表示函数调用,例如try_to_free_pages( )调用
shrink_caches( )、shrink_slab( )和out_of_memory( )。
The page frame reclaiming algorithm must take care of many
kinds of pages owned by User Mode processes, disk caches and memory
caches; moreover, it has to obey several heuristic rules. Thus, it is
not surprising that the PFRA is composed of a large number of functions.
Figure 17-3 shows the
main PFRA functions; an arrow denotes a function invocation, thus for
instance try_to_free_pages( ) invokes
shrink_caches( ), shrink_slab( ), and out_of_memory( ).
正如您所看到的,PFRA 有几个“入口点”。实际上,页框回收主要在三种情况下执行:
As you can see, there are several "entry points" for the PFRA. Actually, page frame reclaiming is performed on essentially three occasions:
内核检测到“内存不足”情况。
The kernel detects a "low on memory" condition.
内核必须释放内存,因为它正在进入挂起到磁盘状态(我们不进一步讨论这种情况)。
The kernel must free memory because it is entering in the suspend-to-disk state (we don't further discuss this case).
如有必要,会定期激活内核线程以执行内存回收。
A kernel thread is activated periodically to perform memory reclaiming, if necessary.
在以下情况下会激活低内存回收:
Low on memory reclaiming is activated in the following cases:
grow_buffers( )
由 调用的函数未能分配新的缓冲区页面(请参阅第 15 章中的“在页面高速缓存中搜索块”_ _getblk(
)部分)。
The grow_buffers( )
function, invoked by _ _getblk(
), fails to allocate a new buffer page (see the section
"Searching Blocks in
the Page Cache" in Chapter 15).
alloc_page_buffers( )
由 调用的函数无法为页面分配临时缓冲区头(请参阅第 16 章中的“读取和写入文件”create_empty_buffers(
)部分)。
The alloc_page_buffers( )
function, invoked by create_empty_buffers(
), fails to allocate the temporary buffer heads for a page
(see the section "Reading
and Writing a File" in Chapter 16).
该函数未能在给定的内存区域列表中分配一组连续的页框(请参阅第 8 章中的“伙伴系统算法”_ _alloc_pages( )
部分)。
The _ _alloc_pages( )
function fails in allocating a group of contiguous page frames in a
given list of memory zones (see the section "The Buddy System
Algorithm" in Chapter
8).
定期回收由两种不同类型的内核线程激活:
Periodic reclaiming is activated by two different types of kernel threads:
克斯瓦普德_ 内核线程,检查某些内存区域中的空闲页帧数量是否已低于水位线(请参阅后面的“定期回收pages_high”部分)。
The kswapd kernel threads, which check whether the number of
free page frames in some memory zone has fallen below the pages_high watermark (see the later
section "Periodic
Reclaiming").
事件_ 内核线程,是预定义工作队列的工作线程(参见第4章“工作队列”一节);PFRA 定期调度预定义工作队列中任务的执行,以回收内存缓存中包含的所有空闲平板由slab分配器处理(参见第8章中的“ Slab分配器” 部分)。
The events kernel threads, which are the worker threads of the predefined work queue (see the section "Work Queues" in Chapter 4); the PFRA periodically schedules the execution of a task in the predefined work queue to reclaim all free slabs included in the memory caches handled by the slab allocator (see the section "The Slab Allocator" in Chapter 8).
我们现在将详细讨论页框回收算法的各个组成部分,包括 图 17-3中所示的所有功能。
We are now going to discuss in detail the various components of the page frame reclaiming algorithm, including all functions shown in Figure 17-3.
属于进程用户模式地址空间或页面缓存的所有页面都分为两个列表,称为 活动列表 和非活动列表 ; 它们也统称为LRU 列表 。前一个列表往往包括最近访问过的页面,而后者往往包括一段时间没有访问过的页面。显然,应该从非活动列表中窃取页面。
All pages belonging to the User Mode address space of processes or to the page cache are grouped into two lists called the active list and the inactive list ; they are also collectively denoted as LRU lists . The former list tends to include the pages that have been accessed recently, while the latter tends to include the pages that have not been accessed for some time. Clearly, pages should be stolen from the inactive list.
页面的活动列表和非活动列表是页框回收算法的核心数据结构。这两个双向链表的头分别存储在每个描述符的active_list和字段中(参见第 8 章“内存区域”一节)。同一描述符中的和字段存储两个列表中的页数。最后,该字段是一个自旋锁,可保护两个列表免受 SMP 系统中的并发访问。inactive_listzonenr_activenr_inactivelru_lock
The active list and the inactive list of pages are the core data
structures of the page frame reclaiming algorithm. The heads of these
two doubly linked lists are stored, respectively, in the active_list and inactive_list fields of each zone descriptor (see the section "Memory Zones" in Chapter 8). The nr_active and nr_inactive fields in the same descriptor
store the number of pages in the two lists. Finally, the lru_lock field is a spin lock that protects
the two lists against concurrent accesses in SMP systems.
如果页面属于 LRU 列表,则PG_lru页面描述符中的标志会被设置。此外,如果该页属于活动列表,PG_active则设置该标志,而如果属于非活动列表,则PG_active
清除该标志。页描述符的字段lru存储指向 LRU 列表中下一个和上一个元素的指针。
If a page belongs to an LRU list, its PG_lru flag in the page descriptor is set.
Moreover, if the page belongs to the active list, the PG_active flag is set, while if it belongs
to the inactive list, the PG_active
flag is cleared. The lru field of
the page descriptor stores the pointers to the next and previous
elements in the LRU list.
有几个辅助函数可用于处理 LRU 列表:
Several auxiliary functions are available to handle the LRU lists:
add_page_to_active_list(
)add_page_to_active_list(
)将页面添加到区域活动列表的头部并增加nr_active区域描述符的字段。
Adds the page to the head of the zone's active list and
increases the nr_active field
of the zone descriptor.
add_page_to_inactive_list(
)add_page_to_inactive_list(
)将页面添加到区域非活动列表的头部并增加nr_inactive
区域描述符的字段。
Adds the page to the head of the zone's inactive list and
increases the nr_inactive
field of the zone descriptor.
del_page_from_active_list(
)del_page_from_active_list(
)从区域的活动列表中删除页面并减少nr_active区域描述符的字段。
Removes the page from the zone's active list and decreases
the nr_active field of the
zone descriptor.
del_page_from_inactive_list(
)del_page_from_inactive_list(
)从区域的非活动列表中删除页面并减少nr_inactive
区域描述符的字段。
Removes the page from the zone's inactive list and
decreases the nr_inactive
field of the zone descriptor.
del_page_from_lru(
)del_page_from_lru(
)检查PG_active
页面的标志;根据结果,从活动或非活动列表中删除页面,减少区域描述符的nr_active或字段,并在必要时清除标志。nr_inactivePG_active
Checks the PG_active
flag of a page; according to the result, removes the page from
the active or inactive list, decreases the nr_active or nr_inactive field of the zone
descriptor, and clears, if necessary, the PG_active flag.
activate_page( )activate_page( )检查PG_active
标志;如果清除(该页面位于非活动列表中),它将将该页面移动到活动列表中:调用del_page_from_inactive_list( ),然后调用add_page_to_active_list(
),最后设置PG_active标志。lru_lock在移动页面之前获取区域的自旋锁。
Checks the PG_active
flag; if it is clear (the page is in the inactive list), it
moves the page into the active list: invokes del_page_from_inactive_list( ), then
invokes add_page_to_active_list(
), and finally sets the PG_active flag. The zone's lru_lock spin lock is acquired before
moving the page.
lru_cache_add( )lru_cache_add( )如果该页面未包含在 LRU 列表中,它将设置该
PG_lru标志,获取该区域的lru_lock自旋锁,并调用add_page_to_inactive_list(
)将该页面插入该区域的非活动列表中。
If the page is not included in an LRU list, it sets the
PG_lru flag, acquires the
zone's lru_lock spin lock,
and invokes add_page_to_inactive_list(
) to insert the page in the zone's inactive
list.
lru_cache_add_active(
)lru_cache_add_active(
)如果该页面未包含在 LRU 列表中,它将设置
PG_lru和PG_active标志,获取区域的
lru_lock自旋锁,并调用add_page_to_active_list(
)将该页面插入区域的活动列表中。
If the page is not included in an LRU list, it sets the
PG_lru and PG_active flags, acquires the zone's
lru_lock spin lock, and
invokes add_page_to_active_list(
) to insert the page in the zone's active list.
实际上,最后两个函数lru_cache_add( )和lru_cache_add_active( )稍微复杂一些。事实上,这两个函数并不立即将页面移动到 LRU 中;而是立即将页面移动到 LRU 中。相反,它们将页面累积在 类型的临时数据结构中pagevec,每个数据结构最多可以包含 14 个页面描述符指针。pagevec仅当结构完全填满时,页面才会在 LRU 列表中有效移动。这种机制增强了系统性能,因为只有当LRU列表被有效修改时才会获取LRU自旋锁。
Actually, the last two functions, lru_cache_add( ) and lru_cache_add_active( ), are slightly more
complicated. In fact, the two functions do not immediately move the
page into an LRU; instead, they accumulate the pages in temporary data
structures of type pagevec, each of
which may contain up to 14 page descriptor pointers. The pages will be
effectively moved in an LRU list only when a pagevec structure is completely filled. This
mechanism enhances the system performance, because the LRU spin lock
is acquired only when the LRU lists are effectively modified.
PFRA 将最近访问过的页面收集到活动列表中,这样在寻找要回收的页面框架时就不会扫描它们。相反,PFRA将长时间未被访问的页面收集到非活动列表中。当然,页面应该根据是否被访问从非活动列表移动到活动列表并返回。
The PFRA collects the pages that were recently accessed in the active list so that it will not scan them when looking for a page frame to reclaim. Conversely, the PFRA collects the pages that have not been accessed for a long time in the inactive list. Of course, pages should move from the inactive list to the active list and back, according to whether they are being accessed.
显然,两个页面状态(“活动”和“非活动”)不足以描述所有可能的访问模式。例如,假设记录器进程每小时在页面中写入一些数据。尽管该页面在大部分时间内处于“非活动”状态,但访问使其变为“活动”,从而拒绝回收相应的页框,即使该页面在整个小时内都不会被访问。当然,这个问题没有通用的解决方案,因为 PFRA 无法预测用户态进程的行为;然而,页面不应在每次访问时更改其状态似乎是合理的。
Clearly, two page states ("active" and "inactive") are not sufficient to describe all possible access patterns. For instance, suppose a logger process writes some data in a page once every hour. Although the page is "inactive" for most of the time, the access makes it "active," thus denying the reclaiming of the corresponding page frame, even if it is not going to be accessed for an entire hour. Of course, there is no general solution to this problem, because the PFRA has no way to predict the behavior of User Mode processes; however, it seems reasonable that pages should not change their status on every single access.
页面描述符中的标志PG_referenced用于将页面从非活动列表移动到活动列表所需的访问次数加倍;它还用于将页面从活动列表移动到非活动列表所需的“缺失访问”数量加倍(见下文)。例如,假设非活动列表中的页面的标志PG_referenced设置为 0。第一次页面访问将标志的值设置为 1,但该页面仍保留在非活动列表中。第二次页面访问找到标志集并导致页面在活动列表中移动。然而,如果在第一次访问之后的给定时间间隔内没有发生第二次访问,则页框回收算法可以重置
PG_referenced旗帜。
The PG_referenced flag in
the page descriptor is used to double the number of accesses
required to move a page from the inactive list to the active list;
it is also used to double the number of "missing accesses" required
to move a page from the active list to the inactive list (see
below). For instance, suppose that a page in the inactive list has
the PG_referenced flag set to 0.
The first page access sets the value of the flag to 1, but the page
remains in the inactive list. The second page access finds the flag
set and causes the page to be moved in the active list. If, however,
the second access does not occur within a given time interval after
the first one, the page frame reclaiming algorithm may reset the
PG_referenced flag.
如图17-4所示,PFRA使用mark_page_accessed( )、page_referenced( )、 和refill_inactive_zone( )函数在LRU列表之间移动页面。图中,包含该页的LRU列表由标志的状态指定PG_active。
As shown in Figure
17-4, the PFRA uses the mark_page_accessed( ), page_referenced( ), and refill_inactive_zone( ) functions to move
the pages across the LRU lists. In the figure, the LRU list
including the page is specified by the status of the PG_active flag.
每当内核必须将页面标记为已访问时,它就会调用该mark_page_accessed( )
函数。每当内核确定某个页面正在被用户模式进程、文件系统层或设备驱动程序引用时,就会发生这种情况。例如,mark_page_accessed( )在以下情况下会被调用:
Whenever the kernel must mark a page as accessed, it
invokes the mark_page_accessed( )
function. This happens every time the kernel determines that a page
is being referenced by a User Mode process, a filesystem layer, or a
device driver. For instance, mark_page_accessed( ) is invoked in the
following cases:
当按需加载进程的匿名页面时(由函数执行;请参阅第 9 章中的“请求分页”
do_anonymous_page(
)部分)。
When loading on demand an anonymous page of a process
(performed by the do_anonymous_page(
) function; see the section "Demand Paging" in
Chapter 9).
当按需加载内存映射文件的页面时(由函数执行;请参阅第 16 章中的“内存映射的需求分页”filemap_nopage(
)部分)。
When loading on demand a page of a memory mapped file
(performed by the filemap_nopage(
) function; see the section "Demand Paging for Memory
Mapping" in Chapter
16).
当按需加载 IPC 共享内存区域的页面时(由函数执行;请参阅第 19 章中的“ IPC 共享内存”shmem_nopage(
)部分)。
When loading on demand a page of an IPC shared memory
region (performed by the shmem_nopage(
) function; see the section "IPC Shared Memory"
in Chapter
19).
当从文件中读取一页数据时(由函数执行
;请参阅第 16 章中的“从文件中读取”do_generic_file_read( )
一节)。
When reading a page of data from a file (performed by the
do_generic_file_read( )
function; see the section "Reading from a
File" in Chapter
16).
当页面交换时(由函数执行;请参阅本章后面的“页面交换do_swap_page( )”部分)。
When swapping in a page (performed by the do_swap_page( ) function; see the
section "Swapping
in Pages" later in this chapter).
当在页高速缓存中查找缓冲页时(参见
第15章“在页高速缓存中查找块”_ _find_get_block( )一节中的函数)。
When looking up a buffer page in the page cache (see the
_ _find_get_block( ) function
in the section "Searching Blocks in the
Page Cache" in Chapter 15).
该mark_page_accessed( )
函数执行以下代码片段:
The mark_page_accessed( )
function executes the following code fragment:
if (!PageActive(页) && PageReferenced(页) && PageLRU(页)) {
activate_page(页面);
清除页面引用(页面);
} else if (!PageReferenced(页面))
设置页面引用(页面);if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
activate_page(page);
ClearPageReferenced(page);
} else if (!PageReferenced(page))
SetPageReferenced(page);如图17-4所示,只有在PG_referenced调用之前设置了该标志,该函数才会将页面从非活动列表移动到活动列表。
As shown in Figure
17-4, the function moves the page from the inactive list to
the active list only if the PG_referenced flag is set before the
invocation.
该page_referenced(
)函数为 PFRA 扫描的每个页面调用一次,如果设置了该PG_referenced标志或页表条目中的某些位,则返回 1;否则返回 1。Accessed否则返回 0。该函数首先检查
PG_referenced页面描述符的标志;如果设置了该标志,则将其清除。接下来,它利用基于对象的反向映射机制来检查并清除
Accessed所有用户模式页表条目中引用该页框的位。为此,该函数使用了三个辅助函数;page_referenced_anon( )、page_referenced_file( )、 和page_referenced_one( ),它们类似于“反向映射”try_to_unmap_xxx(
)部分中描述的功能” 在本章前面。该函数还支持交换令牌;请参阅本章后面的“交换令牌page_referenced(
)”部分。
The page_referenced(
) function, which is invoked once for every page scanned
by the PFRA, returns 1 if either the PG_referenced flag or some of the Accessed bits in the Page Table entries
was set; it returns 0 otherwise. This function first checks the
PG_referenced flag of the page
descriptor; if the flag is set, it clears it. Next, it makes use of
the object-based reverse mapping mechanism to check and clear the
Accessed bits in all User Mode
Page Table entries that refer to the page frame. To do this, the
function makes use of three ancillary functions; page_referenced_anon( ), page_referenced_file( ), and page_referenced_one( ), which are
analogous to the try_to_unmap_xxx(
) functions described in the section "Reverse Mapping" earlier
in this chapter. The page_referenced(
) function also honors the swap token; see the section
"The Swap Token"
later in this chapter.
该page_referenced( )
函数永远不会将页面从活动列表移动到非活动列表;这项工作是由 完成的refill_inactive_zone( )。实际上,此函数的作用不仅仅是将页面从活动列表移动到非活动列表,因此我们将更详细地描述它。
The page_referenced( )
function never moves a page from the active list to the inactive
list; this job is done by refill_inactive_zone( ). In practice, this
function does a lot more than move pages from the active to the
inactive list, so we are going to describe it in greater
detail.
如图 17-3所示,该refill_inactive_zone( )函数由 调用shrink_zone( ),它执行页面高速缓存和用户模式地址空间中的页面回收(请参阅本章后面的“低内存回收”部分)。zone该函数接收两个参数:指向内存区域描述符的指针和sc指向scan_control结构的指针。后一种数据结构被 PFRA 广泛使用,包含有关正在进行的回收操作的信息;其字段如表17-2所示。
As illustrated in Figure 17-3, the refill_inactive_zone( ) function is
invoked by shrink_zone( ), which
performs the reclaiming of pages in the page cache and in the User
Mode address spaces (see the section "Low On Memory
Reclaiming" later in this chapter). The function receives two
parameters: a pointer zone to a
memory zone descriptor, and a pointer sc to a scan_control structure. The latter data
structure is widely used by the PFRA and contains information about
the ongoing reclaiming operation; its fields are shown in Table 17-2.
表 17-2。scan_control描述符的字段
Table 17-2. The fields of the scan_control descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 活动列表中要扫描的目标页数。 Target number of pages to be scanned in the active list. |
| | 当前迭代中扫描的非活动页面数。 Number of inactive pages scanned in the current iteration. |
| | 当前迭代中回收的页面数。 Number of pages reclaimed in the current iteration. |
| | 用户模式地址空间中引用的页数。 Number of pages referenced in the User Mode address spaces. |
整数 int | | 要回收的目标页数。 Target number of pages to be reclaimed. |
无符号整数 unsigned int | | 扫描的优先级,范围在 12 到 0 之间。较低的优先级意味着扫描更多页面。 Priority of the scanning, ranging between 12 and 0. Lower priority implies scanning more pages. |
| | 从调用函数传递的 GFP 掩码。 GFP mask passed from calling function. |
| | 如果设置,则允许将脏页写入磁盘(仅适用于笔记本电脑模式)。 If set, writing a dirty page to disk is allowed (only for laptop mode). |
的作用refill_inactive_zone(
)至关重要,因为将页面从活动列表移至非活动列表意味着该页面迟早会成为 PFRA 的猎物。如果功能过于激进,就会将很多页面从活动列表移动到非活动列表;结果,PFRA会回收大量页框,系统性能会受到影响。另一方面,如果函数太懒,则非活动列表将不会被足够多的未使用页面补充,并且 PFRA 将无法回收内存。因此,该函数实现了自适应行为:它首先在每次调用时扫描活动列表中的少量页面;然而,refill_inactive_zone( )每次调用时扫描的活动页面数量不断增加。priority此行为由数据结构中字段的值控制scan_control(值越低意味着优先级越紧急)。
The role of refill_inactive_zone(
) is critical because moving a page from an active list to
an inactive list means making the page eligible to fall prey, sooner
or later, to the PFRA. If the function is too aggressive, it will
move a lot of pages from the active list to the inactive list; as a
consequence, the PFRA will reclaim a large number of page frames,
and the system performance will be hit. On the other hand, if the
function is too lazy, the inactive list will not be replenished with
a large enough number of unused pages, and the PFRA will fail in
reclaiming memory. Thus, the function implements an adaptive
behavior: it starts by scanning, at every invocation, a small number
of pages in the active list; however, if the PFRA is having trouble
in reclaiming page frames, refill_inactive_zone( ) keeps increasing
the number of active pages scanned at every invocation. This
behavior is controlled by the value of the priority field in the scan_control data structure (a lower value
means a more urgent priority).
另一个启发式规则调节函数的行为refill_inactive_zone( )。LRU 列表包括两种页面:属于用户模式地址空间的页面,以及包含在页面高速缓存中的不属于任何用户模式进程的页面。如前所述,PFRA 应该倾向于缩小页面缓存,同时将用户模式进程拥有的页面保留在 RAM 中。然而,没有固定的“黄金法则”可以在每种情况下产生良好的性能,因此该refill_inactive_zone( )函数依赖于交换趋势 启发值:它决定函数是否将移动所有类型的页面,或者仅移动不属于用户模式地址空间的页面。[ * ]互换趋势值由以下函数计算:
Another heuristic rule regulates the behavior of the refill_inactive_zone( ) function. The LRU
lists include two kinds of pages: those belonging to the User Mode
address spaces, and those included in the page cache that do not
belong to any User Mode process. As stated earlier, the PFRA should
tend to shrink the page cache while leaving in RAM the pages owned
by the User Mode processes. However, no fixed "golden rule" may
yield good performance in every scenario, thus the refill_inactive_zone( ) function relies on
a swap tendency heuristic value: it determines whether the function
will move all kinds of pages, or just the pages that do not belong
to the User Mode address spaces.[*] The swap tendency value is computed by the function as
follows:
映射比率值是属于用户模式地址空间 ( sc->nr_mapped) 的所有内存区域中的页面相对于可分配页框总数的百分比。较高的值
mapped_ratio表示动态内存主要由用户模式进程使用,而较低的值表示主要由页面缓存使用。
The mapped ratio value is the percentage
of pages in all memory zones that belong to User Mode address spaces
(sc->nr_mapped) with respect
to the total number of allocatable page frames. A high value of
mapped_ratio means that the
dynamic memory is mostly used by User Mode processes, while a low
value means that it is mostly used by the page cache.
遇险值衡量 PFRA 在此区域回收页面框架的效率;它基于上次运行 PFRA 时区域的扫描优先级,该优先级存储在描述符prev_priority字段中zone。遇险值取决于区域先前的优先级,如下所示:
The distress value is a measure of how
effectively the PFRA is reclaiming page frames in this zone; it is
based on the scanning priority of the zone in the previous run of
the PFRA, which is stored in the prev_priority field of the zone descriptor. The distress
value depends on the zone's previous priority as
follows:
上一个区域 优先事项 Zone prev. priority | 12...7 12...7 | 6 6 | 5 5 | 4 4 | 3 3 | 2 2 | 1 1 | 0 0 |
遇险值 Distress value | 0 0 | 1 1 | 3 3 | 6 6 | 12 12 | 25 25 | 50 50 | 100 100 |
最后,交换性 value 是一个用户定义的常量,通常设置为 60。系统管理员可以通过写入/proc/sys/vm/swappiness
文件或发出适当的命令来调整该值sysctl(
) 系统调用。
Finally, the swappiness value is a user-defined constant, which is usually
set to 60. The system administrator may tune this value by writing
in the /proc/sys/vm/swappiness
file or by issuing the proper sysctl(
) system call.
仅当区域的交换趋势大于或等于 100 时,才会从进程的地址空间回收页面。因此,如果系统管理员将 swappiness 设置为 0,则 PFRA 永远不会回收用户模式地址空间中的页面,除非区域的交换趋势先前的优先级为零(不太可能发生);如果管理员将 swappiness 设置为 100,则 PFRA 在每次调用时回收用户模式地址空间中的页面。
Pages will be reclaimed from the address spaces of processes only if the zone's swap tendency is greater than or equal to 100. Thus, if the system administrator sets swappiness to 0, then the PFRA never reclaims pages in the User Mode address spaces unless the zone's previous priority is zero (an unlikely event); if the administrator sets swappiness to 100, then the PFRA reclaims pages in the User Mode address spaces at every invocation.
以下是该函数功能的简要描述refill_inactive_zone( ):
Here is a succinct description of what the refill_inactive_zone( ) function
does:
调用lru_add_drain(
)以移至活动和非活动列表中仍包含在pagevec数据结构中的任何页面。
Invokes lru_add_drain(
) to move into the active and inactive lists any page
still contained in the pagevec data structures.
获取zone->lru_lock自旋锁。
Gets the zone->lru_lock spin lock.
执行第一个循环扫描 中的页面zone->active_list,从列表尾部开始向后移动。继续,直到列表为空或直到sc->nr_to_scan页面已被扫描。对于此周期中扫描的每个页面,该函数将其引用计数器增加一,从 中删除页面描述符zone->active_list,并将其放入临时存储器中。l_hold本地列表。但是,如果页框的引用计数器为零,则会将该页放回到活动列表中。事实上,引用计数器等于0的页框应该属于该区域的Buddy系统;然而,要释放页框,首先要减少其使用计数器,然后将页框从 LRU 列表中删除并插入好友系统的列表中。因此,存在一个小时间窗口,其中 PFRA 可能会在 LRU 列表中看到空闲页面。
Performs a first cycle scanning the pages in zone->active_list, starting from
the tail of the list and moving backwards. Continues until the
list is empty or until sc->nr_to_scan pages have been
scanned. For each page scanned in this cycle, the function
increases by one its reference counter, removes the page
descriptor from zone->active_list, and puts it in a
temporary l_hold local list.
However, if the reference counter of the page frame was zero, it
puts back the page in the active list. In fact, page frames
having a reference counter equal to zero should belong to the
zone's Buddy system; however, to free a page frame, first its
usage counter is decreased and then the page frame is removed
from the LRU lists and inserted in the buddy system's list.
Therefore, there is a small time window in which the PFRA may
see a free page in an LRU list.
添加zone->pages_scanned已扫描的活动页面数。
Adds to zone->pages_scanned the number of
active pages that have been scanned.
减去zone->nr_active已移动到本地列表中的页数l_hold。
Subtracts from zone->nr_active the number of pages
that have been moved into the l_hold local list.
释放zone->lru_lock自旋锁。
Releases the zone->lru_lock spin lock.
计算交换趋势值(见上文)。
Computes the swap tendency value (see above).
对本地列表中的页面执行第二次循环l_hold。此循环的目标是将 的页面拆分l_hold为两个名为
l_active和的本地子列表l_inactive。如果交换趋势值小于 100,或者如果该页是匿名的但没有活动的交换区域,或者最后,则page->_mapcount添加属于某个进程的用户模式地址空间的页面(即非负的页面)
l_active如果page_referenced( )应用于页面的函数返回正值,则意味着该页面最近被访问过。在所有其他情况下,页面都会添加到列表中l_inactive。[ * ]
Performs a second cycle on the pages in the l_hold local list. The objective of
this cycle is to split the pages of l_hold into two local sublists called
l_active and l_inactive. A page belonging to the
User Mode address space of some process—that is, a page whose
page->_mapcount is
nonnegative—is added to l_active if the swap tendency value is
smaller than 100, or if the page is anonymous but no swap area
is active, or finally if the page_referenced( ) function applied to
the page returns a positive value, which means that the page has
been recently accessed. In all other cases, the page is added to
the l_inactive list.[*]
获取zone->lru_lock自旋锁。
Gets the zone->lru_lock spin lock.
对本地列表中的页面执行第三次循环,l_inactive将其移动到zone->inactive_list
列表中并更新zone->nr_inactive字段。这样做时,它会减少已移动页框的使用计数器,以撤消步骤 3 中完成的增量。
Performs a third cycle on the pages in the l_inactive local list to move them in
the zone->inactive_list
list and updates the zone->nr_inactive field. In doing
so, it decreases the usage counters of the moved page frames to
undo the increments done in step 3.
对本地列表中的页面执行第四个也是最后一个循环,
l_active将它们移动到zone->active_list列表中并更新zone->nr_active字段。这样做时,它会减少已移动页框的使用计数器,以撤消步骤 3 中完成的增量。
Performs a fourth and last cycle on the pages in the
l_active local list to move
them into the zone->active_list list and updates
the zone->nr_active field.
In doing so, it decreases the usage counters of the moved page
frames to undo the increments done in step 3.
释放zone->lru_lock自旋锁并返回。
Releases the zone->lru_lock spin lock and
returns.
应该注意的是,仅refill_inactive_zone( )检查
PG_referenced属于用户模式地址空间的页面的标志(参见步骤8);在相反的情况下,页面位于活动列表的尾部,因此它们在不久前被访问过,并且在不久的将来不太可能被访问。另一方面,如果某个页面属于某个用户模式进程并且最近被使用过,则该函数不会将其从活动列表中逐出。
It should be noted that refill_inactive_zone( ) checks the
PG_referenced flag only for pages
that belong to the User Mode address spaces (see step 8); in the
opposite case, the pages are in the tail of the active list—hence
they were accessed some time ago—and it is unlikely that they will
be accessed in the near future. On the other hand, the function does
not evict a page from the active list if it is owned by some User
Mode process and has been recently used.
当内存分配失败时,将激活低内存回收。如图17-3所示,内核free_more_memory( )在分配VFS缓冲区或缓冲区头时调用,并try_to_free_pages( )在从伙伴系统分配一个或多个页帧时调用。
Low on memory reclaiming is activated when a memory
allocation fails. As shown in Figure 17-3, the kernel
invokes free_more_memory( ) while
allocating a VFS buffer or a buffer head, and it invokes try_to_free_pages( ) while allocating one or
more page frames from the buddy system.
该free_more_memory( )
函数执行以下操作:
The free_more_memory( )
function performs the following actions:
调用wakeup_bdflush(
)唤醒pdflush 内核线程并触发页面缓存中 1024 个脏页的写入操作(请参阅第 15 章中的“ pdflush 内核线程”部分)。将脏页写入磁盘最终可能会释放包含缓冲区、缓冲区头和其他 VFS 数据结构的页帧。
Invokes wakeup_bdflush(
) to wake a pdflush kernel thread and trigger write operations for
1024 dirty pages in the page cache (see the section "The pdflush Kernel
Threads" in Chapter
15). Writing dirty pages to disk may eventually make
freeable the page frames containing buffers, buffers heads, and
other VFS data structures.
Invokes the service routine of the sched_yield( ) system call to give the
pdflush kernel thread a chance to run.
在系统中的所有内存节点上启动循环(请参阅第 8 章中的“非统一内存访问(NUMA) ”部分)。对于每个节点,调用向其传递“低”内存区域列表的函数(在 80 × 86 架构中,并且;请参阅第 8 章中的“内存区域”
部分)。try_to_free_pages(
)ZONE_DMAZONE_NORMAL
Starts a loop over all memory nodes in the system (see the
section "Non-Uniform
Memory Access (NUMA)" in Chapter 8). For each node,
invokes the try_to_free_pages(
) function passing to it a list of the "low" memory
zones (in the 80 × 86 architecture, ZONE_DMA and ZONE_NORMAL; see the section "Memory Zones" in
Chapter 8).
该try_to_free_pages(
)函数接收三个参数:
The try_to_free_pages(
) function receives three parameters:
zoneszonesA list of memory zones in which pages should be reclaimed (see the section "Memory Zones" in Chapter 8)
gfp_maskgfp_maskThe set of allocation flags that were used by the failed memory allocation (see the section "The Zoned Page Frame Allocator" in Chapter 8)
orderorder不曾用过
Not used
shrink_caches(
)该函数的目标是通过重复调用和函数来释放至少 32 个页框shrink_slab( )
,每次调用的优先级都高于前一次调用。辅助函数在类型描述符中获取优先级以及正在进行的扫描操作的其他参数scan_control(参见本章前面的表 17-2 )。最低的初始优先级为 12,而最高的最终优先级为 0。如果
在try_to_free_pages( )13 次重复调用和 之一中未能成功回收至少 32 个页框,shrink_caches( )shrink_slab( ),PFRA 遇到了严重的麻烦,它只有最后一招:终止进程以释放其所有页面框架。该操作由函数执行(请参阅本章后面的“内存不足杀手out_of_memory( )”部分)。
The goal of the function is to free at least 32 page frames by
repeatedly invoking the shrink_caches(
) and shrink_slab( )
functions, each time with a higher priority than the previous
invocation. The ancillary functions get the priority level—as well
as other parameters of the ongoing scan operation—in a descriptor of
type scan_control (see Table 17-2 earlier in
this chapter). The lowest, initial priority level is 12, while the
highest, final priority level is 0. If try_to_free_pages( ) does not succeed in
reclaiming at least 32 page frames in one of the 13 repeated
invocations of shrink_caches( )
and shrink_slab( ), the PFRA is
in serious trouble, and it has just one last resort: killing a
process to free all its page frames. This operation is performed by
the out_of_memory( ) function
(see the section "The
Out of Memory Killer" later in this chapter).
该函数执行以下主要步骤:
The function performs the following main steps:
分配并初始化scan_control描述符。特别是,将gfp_mask分配掩码存储在
gfp_mask字段中。
Allocates and initializes a scan_control descriptor. In
particular, stores the gfp_mask allocation mask in the
gfp_mask field.
对于列表中的每个区域,它将区域描述符的字段zones设置为初始优先级 (12)。temp_priority此外,它还计算区域的 LRU 列表中包含的页面总数。
For each zone in the zones lists, it sets the temp_priority field of the zone
descriptor to the initial priority (12). Moreover, it computes
the total number of pages contained in the LRU lists of the
zones.
执行最多 13 次迭代的循环,从优先级 12 降到 0;在每次迭代中执行以下子步骤:
更新scan_control描述符的某些字段。特别是,它在字段中存储nr_mapped用户模式进程拥有的页面总数,并在字段中存储priority本次迭代的当前优先级。nr_scanned此外,它还将和字段设置为零
nr_reclaimed。
调用将列表和描述符地址
shrink_caches(
)作为参数传递。此功能扫描区域的非活动页面(见下文)。zonesscan_control
调用以从可收缩内核缓存中回收页面(请参阅本章后面的“回收可收缩磁盘缓存的页面shrink_slab(
)”部分)。
如果该current->reclaim_state字段不是,则将从slab分配器缓存回收的页数NULL添加到
描述nr_reclaimed符的字段
中;scan_control该数字存储在进程描述符字段指向的一个小数据结构中。该函数在调用函数之前_ _alloc_pages( )设置该字段,并在函数终止后立即清除该字段。(奇怪的是,该
函数没有设置此字段。)current->reclaim_statetry_to_free_pages( )free_more_memory( )
如果已经达到目标(描述符nr_reclaimed的字段scan_control大于或等于32),则中断循环并跳转到步骤4。
目标尚未达到。如果到目前为止至少已扫描 49 页,该函数将调用
wakeup_bdflush( )激活pdflush 内核线程并将页面缓存中的一些脏页写入磁盘(请参阅第15章中的“查找要刷新的脏页”部分)。
如果函数已经执行了四次迭代但没有达到目标,它会调用blk_congestion_wait( )挂起当前进程,直到任何WRITE请求队列变得不拥塞或直到 100 ms 超时过去(请参阅第 14 章中的“请求描述符”部分)。
Performs a loop of at most 13 iterations, from priority 12 down to 0; in each iteration performs the following substeps:
Updates some field of the scan_control descriptor. In
particular, it stores in the nr_mapped field the total number
of pages owned by User Mode processes, and in the priority field the current
priority of this iteration. Also, it sets to zero the
nr_scanned and nr_reclaimed fields.
Invokes shrink_caches(
) passing as arguments the zones list and the address of the
scan_control descriptor.
This function scans the inactive pages of the zones (see
below).
Invokes shrink_slab(
) to reclaim pages from the shrinkable kernel
caches (see the section "Reclaiming Pages of
Shrinkable Disk Caches" later in this
chapter).
If the current->reclaim_state field is
not NULL, it adds to the
nr_reclaimed field of the
scan_control descriptor
the number of pages reclaimed from the slab allocator
caches; this number is stored in a small data structure
pointed to by the process descriptor field. The _ _alloc_pages( ) function sets up
the current->reclaim_state field
before invoking the try_to_free_pages( ) function, and
clears the field right after its termination. (Oddly, the
free_more_memory( )
function does not set this field.)
If the target has been reached (the nr_reclaimed field of the scan_control descriptor is greater
than or equal to 32), it breaks the loop and jumps to step
4.
The target has not yet been reached. If at least 49
pages have been scanned so far, the function invokes
wakeup_bdflush( ) to
activate a pdflush kernel thread and write some dirty pages in
the page cache to disk (see the section "Looking for Dirty
Pages To Be Flushed" in Chapter 15).
If the function has already performed four iterations
without reaching the target, it invokes blk_congestion_wait( ) to suspend
the current process until any WRITE request queue becomes
uncongested or until a 100 ms time-out elapses (see the
section "Request
Descriptors" in Chapter 14).
将每个区域描述符的字段设置prev_priority
为上次调用时使用的优先级shrink_caches(
);它存储在temp_priority区域描述符的字段中。
Sets the prev_priority
field of each zone descriptor to the priority level used in the
last invocation of shrink_caches(
); it is stored in the temp_priority field of the zone
descriptor.
如果回收成功则返回 1,否则返回 0。
Returns 1 if the reclaiming was successful, 0 otherwise.
该shrink_caches(
)函数由 调用try_to_free_pages( )。它作用于两个参数:内存区域列表和描述符
的zones地址
。scscan_control
The shrink_caches(
) function is invoked by try_to_free_pages( ). It acts on two
parameters: the zones list of
memory zones, and the address sc
of a scan_control
descriptor.
该函数的目的只是调用列表shrink_zone( )中每个区域的函数zones。但是,在调用shrink_zone( )给定区域之前,请使用存储在字段中的值shrink_caches( )
更新temp_priority区域描述符的字段sc->priority;这是扫描操作的当前优先级。此外,如果前一次调用 PFRA 的优先级值高于当前优先级值(即,该区域中的页框回收现在更难执行),则将当前优先级复制到区域描述符的字段shrink_caches(
)中prev_priority。最后,如果以下情况,shrink_caches(
)则不会shrink_zone(
)在给定区域上调用all_unreclaimable区域描述符中的标志已设置,并且当前优先级小于 12,即shrink_caches( )在 的第一次迭代中没有被调用try_to_free_pages( )。当PFRA
all_unreclaimable确定某个区域充满了不可回收的页面以至于扫描该区域的页面只是浪费时间时,就会设置该标志。
The purpose of this function is simply to invoke the shrink_zone( ) function on each zone in
the zones list. However, before
invoking shrink_zone( ) on a
given zone, shrink_caches( )
updates the temp_priority field
of the zone's descriptor by using the value stored in the sc->priority field; this is the current
priority level of the scanning operation. Moreover, if the priority
value of the previous invocation of the PFRA is higher than the
current priority value—that is, page frame reclaiming in this zone
is now harder to do—shrink_caches(
) copies the current priority level into the prev_priority field of the zone
descriptor. Finally, shrink_caches(
) does not invoke shrink_zone(
) on a given zone if the all_unreclaimable flag in the zone
descriptor is set and the current priority level is less than
12—that is, shrink_caches( ) is
not being invoked in the very first iteration of try_to_free_pages( ). The PFRA sets the
all_unreclaimable flag when it
decides that a zone is so full of unreclaimable pages that scanning
the zone's pages is just a waste of time.
该shrink_zone( )
函数作用于两个参数:zone,指向struct_zone描述符的指针,以及sc,指向描述符的指针scan_control。该函数的目标是从区域的非活动列表中回收 32 个页面;该函数试图通过重复调用一个名为 的辅助函数来达到此目标shrink_cache(
),每次都在该区域的非活动列表的较大部分上。此外,通过重复调用前面部分“最近最少使用(LRU)列表shrink_zone( )”中描述的功能
来补充区域的非活动列表
。refill_inactive_zone( )
The shrink_zone( )
function acts on two parameters: zone, a pointer to a struct_zone descriptor, and sc, a pointer to a scan_control descriptor. The goal of this
function is to reclaim 32 pages from the zone's inactive list; the
function tries to reach this goal by invoking repeatedly an
auxiliary function called shrink_cache(
), each time on larger portion of the zone's inactive
list. Moreover, shrink_zone( )
replenishes the zone's inactive list by repeatedly invoking the
refill_inactive_zone( ) function
described in the earlier section "The Least Recently Used (LRU)
Lists."
区域描述符的nr_scan_active和
字段在这里起着特殊的作用。nr_scan_inactive为了提高效率,该函数批量处理 32 个页面。因此,如果函数在低特权级别( 的高值sc->priority)下运行,并且 LRU 列表之一不包含足够的页面,则函数会跳过对该列表的扫描。然而,因此跳过的活动或非活动页面的数量记录在nr_scan_active或中nr_scan_inactive,以便在下次调用该函数时考虑跳过的页面。
The nr_scan_active and
nr_scan_inactive fields of the
zone descriptor play a special role here. To be efficient, the
function works on batches of 32 pages. Thus, if the function is
running at a low privilege level (high value of sc->priority) and one of the LRU lists
does not contain enough pages, the function skips the scanning on
that list. However, the number of active or inactive pages thus
skipped is recorded in nr_scan_active or nr_scan_inactive, so that the skipped
pages will be considered in the next invocation of the
function.
具体来说,该shrink_zone(
)函数执行以下步骤:
Specifically, the shrink_zone(
) function performs the following steps:
将 增加zone->nr_scan_active活动列表中元素总数的一小部分 ( zone->nr_active)。实际增量由当前优先级决定,范围从zone->nr_active/2 12
到zone->nr_active/2 0
(即区域中活动页面的总数)。
Increases the zone->nr_scan_active by a fraction
of the total number of elements in the active list (zone->nr_active). The actual
increment is determined by the current priority level and ranges
from zone->nr_active/212
to zone->nr_active/20
(i.e., the whole number of active pages in the zone).
将 增加zone->nr_scan_inactive活动列表中元素总数的一小部分 ( zone->nr_inactive)。实际增量由当前优先级决定,范围为zone->nr_inactive/2 12
至zone->nr_inactive。
Increases the zone->nr_scan_inactive by a
fraction of the total number of elements in the active list
(zone->nr_inactive). The
actual increment is determined by the current priority level and
ranges from zone->nr_inactive/212
to zone->nr_inactive.
如果该zone->nr_scan_active字段大于或等于 32,则该函数将其值复制到nr_active局部变量中并将该字段设置为零;否则,它设置nr_active为零。
If the zone->nr_scan_active field is
greater than or equal to 32, the function copies its value in
the nr_active local variable
and sets the field to zero; otherwise, it sets nr_active to zero.
如果该zone->nr_scan_inactive字段大于或等于 32,则该函数将其值复制到nr_inactive局部变量中并将该字段设置为零;否则,它设置nr_inactive为零。
If the zone->nr_scan_inactive field is
greater than or equal to 32, the function copies its value in
the nr_inactive local
variable and sets the field to zero; otherwise, it sets nr_inactive to zero.
将描述符sc->nr_to_reclaim的字段
设置scan_control为 32。
Sets the sc->nr_to_reclaim field of the
scan_control descriptor to
32.
如果nr_active和
nr_inactive均为 0,则不执行任何操作:函数终止。这是一种不太可能发生的情况,因为用户模式进程没有分配给它们的页帧。
If both nr_active and
nr_inactive are 0, there is
nothing to be done: the function terminates. This is an unlikely
situation where User Mode processes have no page frames
allocated to them.
如果nr_active为正,则会补充该区域的非活动列表:
sc->nr_to_scan = min(nr_active, 32); nr_active -= sc->nr_to_scan; refill_inactive_zone(区域, sc);
If nr_active is
positive, it replenishes the zone's inactive list:
sc->nr_to_scan = min(nr_active, 32); nr_active -= sc->nr_to_scan; refill_inactive_zone(zone, sc);
如果nr_inactive为正,它会尝试从非活动列表中回收最多 32 个页面:
sc->nr_to_scan = min(nr_inactive, 32); nr_inactive -= sc->nr_to_scan; 收缩缓存(区域,sc);
If nr_inactive is
positive, it tries to reclaim at most 32 pages from the inactive
list:
sc->nr_to_scan = min(nr_inactive, 32); nr_inactive -= sc->nr_to_scan; shrink_cache(zone, sc);
如果shrink_zone( )
成功回收 32 页(sc->nr_to_reclaim现在为零或负数),则终止。否则,它跳回步骤 6。
If shrink_zone( )
succeeds in reclaiming 32 pages (sc->nr_to_reclaim is now zero or
negative), it terminates. Otherwise, it jumps back to step
6.
该shrink_cache( )
函数是另一个辅助函数,其主要目的是从区域的非活动列表中提取一组页面,将它们放入临时列表中,并调用该函数对该列表中的每个页面有效地执行页框回收shrink_list( )。该
shrink_cache( )函数作用于与 相同的参数shrink_zones(
),即zone和
sc,并执行以下主要步骤:
The shrink_cache( )
function is yet another auxiliary function whose main purpose is to
extract from the zone's inactive list a group of pages, put them in
a temporary list, and invoke the shrink_list( ) function to effectively
perform page frame reclaiming on every page in that list. The
shrink_cache( ) function acts on
the same parameters as shrink_zones(
), namely zone and
sc, and performs the following
main steps:
调用lru_add_drain(
)以移入活动和非活动列表中仍包含在数据结构中的任何页面(请参阅本章前面的“最近最少使用(LRU)列表pagevec”部分)。
Invokes lru_add_drain(
) to move into the active and inactive lists any page
still contained in the pagevec data structures (see the
section "The Least
Recently Used (LRU) Lists" earlier in this
chapter).
获取zone->lru_lock自旋锁。
Gets the zone->lru_lock spin lock.
最多考虑非活动列表中的 32 个页面;对于每个页面,该函数都会增加其使用计数器,检查该页面是否未被释放给好友系统(请参阅 的步骤 3 中的讨论refill_inactive_zone( )),并将该页面从区域的非活动列表移动到本地列表。
Considers at most 32 pages in the inactive list; for each
page, the function increases its usage counter, checks whether
the page is not being freed to the buddy system (see the
discussion at step 3 of refill_inactive_zone( )), and moves
the page from the zone's inactive list to a local list.
将计数器减少zone->nr_inactive从非活动列表中删除的页数。
Decreases the counter zone->nr_inactive by the number of
pages removed from the inactive list.
将计数器增加zone->pages_scanned非活动列表中有效检查的页数。
Increases the counter zone->pages_scanned by the number
of pages effectively examined in the inactive list.
释放zone->lru_lock自旋锁。
Releases the zone->lru_lock spin lock.
调用shrink_list(
)向其传递在上面步骤 3 中收集的页面(本地列表)的函数。下面讨论这个函数(正如您毫无疑问所期待的那样)。
Invokes the shrink_list(
) function passing to it the (local list of) pages
collected in step 3 above. This function is discussed below (as
you were no doubt expecting).
将sc->nr_to_reclaim字段减少 实际回收的页数shrink_list( )。
Decreases the sc->nr_to_reclaim field by the
number of pages actually reclaimed by shrink_list( ).
再次获得zone->lru_lock自旋锁。
Gets again the zone->lru_lock spin lock.
shrink_list(
)将本地列表中未成功释放的所有页面放回到非活动或活动列表中。请注意,shrink_list( )可能会通过设置其标志将页面标记为活动页面PG_active。此操作是使用数据结构在一批页面中执行的(请参阅本章前面的“最近最少使用(LRU)列表pagevec”部分)。
Puts back in the inactive or active list all pages of the
local list that shrink_list(
) did not succeed in freeing. Notice that shrink_list( ) might mark a page as
active by setting its PG_active flag. This operation is
performed in a batch of pages using a pagevec data structure (see the
section "The Least
Recently Used (LRU) Lists" earlier in this
chapter).
如果该函数至少扫描了sc->nr_to_scan页数,并且没有成功回收目标页数(即
sc->nr_to_reclaim仍然为正数),则它跳回步骤 3。
If the function scanned at least sc->nr_to_scan pages, and if it
didn't succeed in reclaiming the target number of pages (i.e.,
sc->nr_to_reclaim is still
positive), it jumps back to step 3.
释放zone->lru_lock自旋锁并终止。
Releases the zone->lru_lock spin lock and
terminates.
现在我们已经到达了页框回收的核心部分。虽然到目前为止所说明的函数(从try_to_free_pages( )到 )
的目的shrink_cache( )是选择适当的一组候选页面进行回收,但该shrink_list( )函数实际上尝试回收作为列表中的参数传递的页面page_list。第二个参数,即sc,是指向描述符的通常指针scan_control。返回时shrink_list( ),
page_list包含无法释放的页面。
We have now reached the heart of page frame
reclaiming. While the purpose of the functions illustrated so far,
from try_to_free_pages( ) to
shrink_cache( ), was to select
the proper set of pages candidates for reclaiming, the shrink_list( ) function effectively tries
to reclaim the pages passed as a parameter in the page_list list. The second parameter,
namely sc, is the usual pointer
to a scan_control descriptor.
When shrink_list( ) returns,
page_list contains the pages that
couldn't be freed.
该函数执行以下操作:
The function performs the following actions:
如果need_resched
设置了当前进程的字段,则会调用schedule( ).
If the need_resched
field of the current process is set, it invokes schedule( ).
在列表中包含的每个页面描述符上启动一个循环
page_list。对于每个列表项,它从列表中删除页面描述符并尝试回收页面框架;如果由于某种原因无法释放页框,它将将该页描述符插入本地列表中。
Starts a cycle on every page descriptor included in the
page_list list. For each list
item, it removes the page descriptor from the list and tries to
reclaim the page frame; if for some reason the page frame could
not be freed, it inserts the page descriptor in a local
list.
现在page_list列表为空:函数将页面描述符从本地列表移回列表page_list
。
Now the page_list list
is empty: the function moves back the page descriptors from the
local list to the page_list
list.
将sc->nr_reclaimed字段增加步骤 2 中回收的页框数,并返回该数字。
Increases the sc->nr_reclaimed field by the
number of page frames reclaimed in step 2, and returns that
number.
当然,真正有趣的shrink_list( )是尝试回收页面框架的代码。该代码的流程图
如图17-5所示。
Of course, what is really interesting in shrink_list( ) is the code that tries to
reclaim a page frame. The flow diagram of this code is shown in
Figure 17-5.
对于由 处理的每个页框,只有三种可能的结果shrink_list( ):
There are only three possible outcomes for each page frame
handled by shrink_list( ):
通过调用该函数将页面释放到区域的伙伴系统(参见第8章中的“ Per-CPU Page Frame Cache ”free_cold_page(
)部分);因此,页面被有效回收。
The page is released to the zone's buddy system by
invoking the free_cold_page(
) function (see the section "The Per-CPU Page Frame
Cache" in Chapter
8); hence, the page is effectively reclaimed.
该页面未被回收,因此将被重新插入列表中page_list;但是,
shrink_list( )假设在不久的将来可以回收该页面。因此,该函数将PG_active页面描述符中的标志清除,以便该页面将被放回到内存区域的非活动列表中(参见shrink_cache( )上面描述符中的步骤9)。该事件对应于图 17-5中标记为“INACTIVE”的小框。
The page is not reclaimed, thus it will be reinserted in
the page_list list; however,
shrink_list( ) assumes that
it will be possible to reclaim the page in the near future.
Thus, the function leaves the PG_active flag in the page descriptor
cleared, so that the page will be put back in the inactive list
of the memory zone (see step 9 in the descriptor of shrink_cache( ) above). This event
corresponds to the small boxes labeled as "INACTIVE" in Figure 17-5.
该页面未被回收,因此将被重新插入列表中page_list;但是,要么该页面正在使用中,要么shrink_list( )假设在可预见的将来不可能回收该页面。因此,该函数在页面描述符中设置标志PG_active,以便该页面将被放入内存区域的活动列表中。该事件对应于图 17-5中标记为“ACTIVE”的小框。
The page is not reclaimed, thus it will be reinserted in
the page_list list; however,
either the page is in active use, or shrink_list( ) assumes that it will be
impossible to reclaim the page in the foreseeable future. Thus,
the function sets the PG_active flag in the page descriptor,
so that the page will be put in the active list of the memory
zone. This event corresponds to the small boxes labeled as
"ACTIVE" in Figure
17-5.
该函数从不尝试回收已锁定(设置标志)或处于写回状态(设置标志)shrink_list( )的页面。为了测试该页面最近是否被引用,调用,这在本章前面的“最近最少使用(LRU)列表”部分中进行了描述。PG_lockedPG_writebackshrink_list( )page_referenced( )
The shrink_list( ) function
never tries to reclaim a page that is locked (PG_locked flag set) or under writeback
(PG_writeback flag set). In order
to test whether the page was recently referenced, shrink_list( ) invokes page_referenced( ), which was described in
the section "The Least
Recently Used (LRU) Lists" earlier in this chapter.
要回收匿名页面,必须将该页面添加到交换缓存中,并且必须为其保留交换区域中的新槽位;有关详细信息,请参阅本章后面的“交换”部分。
To reclaim an anonymous page, the page must be added to the swap cache, and a new slot in a swap area must be reserved for it; see the section "Swapping" later in this chapter for details.
如果页面位于某个进程的用户模式地址空间中(_mapcount页面描述符中的字段大于或等于零),shrink_list( )则调用该try_to_unmap( )函数来定位引用该页框的所有用户模式页表条目(请参阅本章前面的“反向映射”)。当然,只有当该函数返回时才可以进行回收SWAP_SUCCESS。
If the page is in the User Mode address space of some process
(the _mapcount field in the page
descriptor is greater than or equal to zero), shrink_list( ) invokes the try_to_unmap( ) function to locate all
User Mode Page Table entries that refer to the page frame (see the
section "Reverse
Mapping" earlier in this chapter). Of course, reclaiming may
proceed only if this function returns SWAP_SUCCESS.
如果页面脏了,则无法回收,除非将其写入磁盘。为此,shrink_list(
)需要依赖pageout(
)下面描述的函数。pageout( )仅当不必发出写操作或者写操作很快完成时,才可以继续回收页框。
If the page is dirty, it cannot be reclaimed unless it is
written to disk. To do this, shrink_list(
) relies on the pageout(
) function, which is described next. The reclaiming of the
page frame may proceed only if either pageout( ) does not have to issue a write
operation, or if the write operation finishes soon.
如果页面包含VFS缓冲区,shrink_list( )则调用以释放关联的缓冲区头(请参阅第15章中的“释放块设备缓冲区页面”try_to_release_page( )部分)。
If the page contains VFS buffers, shrink_list( ) invokes try_to_release_page( ) to release the
associated buffer heads (see the section "Releasing Block Device Buffer
Pages" in Chapter
15).
最后,如果一切顺利,shrink_list( )请检查页面的引用计数器:如果它等于 2,则该页面只有两个所有者:页面缓存(或交换缓存,如果是匿名页面)和 PFRA 本身(参考计数器在步骤 3 中增加shrink_cache( );参见前面)。在这种情况下,只要页面仍然不脏,就可以回收该页面。为此,首先根据PG_swapcache页面描述符的标志的值,将页面从页面缓存或交换缓存中删除;然后,该free_cold_page( )
函数被执行。
Finally, if everything went smoothly, shrink_list( ) checks the reference
counter of the page: if it is equal to two, the page has just two
owners: the page cache (or the swap cache, in case of anonymous
pages), and the PFRA itself (the reference counter was increased in
step 3 of shrink_cache( ); see
earlier). In this case, the page can be reclaimed, provided it is
still not dirty. To do this, first the page is removed from the page
cache or the swap cache, according to the value of the PG_swapcache flag of the page descriptor;
then, the free_cold_page( )
function is executed.
当必须将脏页写入磁盘时pageout( )
调用该函数。shrink_list(
)本质上,该函数执行以下操作:
The pageout( )
function is invoked by shrink_list(
) when a dirty page must be written to disk. Essentially,
the function performs the following operations:
检查页面是否包含在页面缓存或交换缓存中(请参阅本章后面的“交换缓存”部分)。此外,检查页面是否仅由页面缓存(或交换缓存)和 PFRA 拥有。如果检查失败则返回
PAGE_KEEP(如果该页不可回收,则将其写入磁盘是没有意义的shrink_list(
))。
Checks that the page is included in the page cache or in
the swap cache (see the section "The Swap Cache"
later in this chapter). Moreover, checks that the page is owned
only by the page cache—or the swap cache—and the PFRA. Returns
PAGE_KEEP if a check has
failed (it does not make sense to write the page to disk if it
is not reclaimable by shrink_list(
)).
检查对象writepage的方法是否address_space已定义;否则返回PAGE_ACTIVATE
。
Checks that the writepage method of the address_space object is defined;
returns PAGE_ACTIVATE
otherwise.
检查当前进程是否可以向与该
address_space对象关联的块设备的请求队列发出写入请求。本质上,kswapd 和pdflush 内核线程可能总是发出写请求;正常进程只有在请求队列不拥塞的情况下才能发出写请求,除非该current->backing_dev_info字段指向块设备的数据结构(参见第16章“写入文件”
一节中函数backing_dev_info描述的第3步))。generic_file_aio_write_nolock( )
Checks that the current process can issue write requests
to the request queue of the block device associated with the
address_space object.
Essentially, the kswapd and pdflush kernel threads may always issue the write
request; normal processes can issue the write request only if
the request queue is not congested, unless the current->backing_dev_info field
points to the backing_dev_info data structure of the
block device (see step 3 of the description of the generic_file_aio_write_nolock( )
function in the section "Writing to a File"
in Chapter
16).
检查页面是否仍然脏;如果不是,则返回
PAGE_CLEAN。
Checks that the page is still dirty; if not, returns
PAGE_CLEAN.
设置一个writeback_control描述符并调用writepage该对象的方法来启动写回操作(请参阅第16章中的“将脏页写入磁盘”address_space部分)。
Sets up a writeback_control descriptor and
invokes the writepage method
of the address_space object
to start a write back operation (see the section "Writing Dirty Pages to
Disk" in Chapter
16).
如果该writepage方法返回错误代码,则该函数返回PAGE_ACTIVATE。
If the writepage method
returned an error code, the function returns PAGE_ACTIVATE.
返回PAGE_SUCCESS。
Returns PAGE_SUCCESS.
从前面的章节我们知道,除了页面缓存之外,内核还使用其他磁盘缓存,例如 dentry 缓存和索引节点缓存(参见第12章中的“ dentry Cache ” 部分)。当 PFRA 尝试回收页框时,它还应该检查其中一些磁盘缓存是否可以收缩。
We know from the previous chapters that the kernel uses other disk caches beside the page cache, for instance the dentry cache and the inode cache (see the section "The dentry Cache" in Chapter 12). When the PFRA tries to reclaim page frames, it should also check whether some of these disk caches can be shrunk.
PFRA 考虑的每个磁盘缓存都必须 在初始化时注册一个收缩器函数。收缩器函数需要两个参数:要回收的页框的目标数量,以及一组 GFP 分配标志;该函数执行从磁盘缓存回收页面所需的操作,然后返回缓存中剩余的可回收页面数。
Every disk cache that is considered by the PFRA must have a shrinker function registered at initialization time. The shrinker function expects two parameters: the target number of page frames to be reclaimed, and a set of GFP allocation flags; the function does what is required to reclaim the pages from the disk cache, then it returns the number of reclaimable pages remaining in the cache.
该set_shrinker( )函数向 PFRA 注册收缩器函数。该函数分配一个类型为 的描述符shrinker,将收缩器函数的地址存储在描述符中,然后将该描述符插入到以shrinker_list全局变量为根的全局列表中。该set_shrinker( )函数还初始化描述seeks符的字段shrinker:非正式地,它是一个参数,指示缓存中的一项被删除后重新创建它需要花费多少费用。
The set_shrinker( ) function
registers a shrinker function with the PFRA. This function allocates a
descriptor of type shrinker, stores
the address of the shrinker function in the descriptor, and then
inserts the descriptor in a global list rooted at the shrinker_list global variable. The set_shrinker( ) function also initializes
the seeks field of the shrinker descriptor: informally, it is a
parameter that indicates how much it costs to re-create one item of
the cache once it is removed.
在Linux 2.6.11中,向PFRA注册的磁盘缓存很少:除了dentry缓存和inode缓存之外,只有磁盘配额层,文件系统元信息块缓存(主要用于文件系统的扩展属性)和XFS日志文件系统寄存器收缩器函数。
In Linux 2.6.11 there are few disk caches registered with the PFRA: besides the dentry cache and the inode cache, only the disk quota layer, the filesystem meta information block cache (mainly used for filesystems' extended attributes), and the XFS journaling filesystem register shrinker functions .
调用 PFRA 的从可收缩磁盘缓存回收页面的函数shrink_slab( )
(这个名称有点误导,因为该函数与板分配器缓存关系不大)。该函数由 调用,如前面的“低内存回收”try_to_free_pages( )部分中所述,以及由 调用,这在后面的“定期回收”部分中进行了描述。balance_pgdat( )
The PFRA's function that reclaims pages from the shrinkable disk
caches is called shrink_slab( )
(the name is a bit misleading, because the function has little to do
with the slab allocator caches). This function is invoked by try_to_free_pages( ), as explained in the
earlier section "Low On
Memory Reclaiming," and by balance_pgdat( ), which is described in the
later section "Periodic
Reclaiming."
该shrink_slab( )函数尝试平衡从可收缩磁盘缓存回收的成本和从 LRU 列表回收的成本(由 执行shrink_list( ))。本质上,该函数遍历描述符中的列表shrinker
以调用收缩器函数并获取磁盘缓存中可回收页面的总数。然后,该函数再次扫描列表shrinker
描述符;对于每个可收缩磁盘缓存,该函数根据磁盘缓存中可回收页面的数量、在磁盘缓存中重新创建页面的相对成本以及LRU 列表中的页数,并调用收缩器函数来尝试回收至少 128 页的批次。
The shrink_slab( ) function
tries to balance the cost of reclaiming from the shrinkable disk cache
with the cost of reclaiming from the LRU lists (performed by shrink_list( )). Essentially, the function
walks the list in the shrinker
descriptors to invoke the shrinker functions and get the total number
of reclaimable pages in the disk caches. Then, the function scans
again the list of the shrinker
descriptor; for each shrinkable disk cache, the function heuristically
computes a target number of page frames to be reclaimed—based on the
number of reclaimable pages in the disk caches, on the relative cost
of re-creating a page in the disk cache, and on the number of pages in
the LRU lists—and invokes the shrinker function to try to reclaim
batches of at least 128 pages.
由于篇幅有限,我们将限制自己简要描述 dentry 缓存和 inode 缓存的收缩器功能。
For lack of space, we'll limit ourselves to describe briefly the shrinker functions of the dentry cache and of the inode cache.
该shrink_dcache_memory(
)函数是dentry缓存的收缩函数;它在缓存中搜索未使用的 dentry 对象(即没有被任何进程引用的对象,请参阅第 12 章中的“ dentry 对象”
部分)并释放它们。
The shrink_dcache_memory(
) function is the shrinker function for the dentry cache;
it searches the cache for unused dentry objects—that is, objects not
referenced by any process, see the section "dentry Objects" in
Chapter 12—and releases
them.
由于dentry缓存对象是通过slab分配器分配的,该shrink_dcache_memory(
)函数可能会导致一些slab变得空闲,从而导致一些页框被回收(参见本章后面的“定期回收cache_reap( )”一节)。此外,dentry 缓存充当 inode 缓存的控制器。因此,当一个dentry对象被释放时,存储相应inode的页面可能会变得不被使用,从而最终被释放。
Because the dentry cache objects are allocated through the
slab allocator, the shrink_dcache_memory(
) function may lead some slabs to become free, causing
some page frames to be consequently reclaimed by cache_reap( ) (see the section "Periodic Reclaiming"
later in this chapter). Moreover, the dentry cache acts as a
controller of the inode cache. Therefore, when a dentry object is
released, the pages storing the corresponding inode may become
unused, and thus eventually released.
该shrink_dcache_memory( )
函数接收要回收的页框数量和 GFP 掩码作为其参数。首先检查_ _GFP_FSGFP 掩码中的位是否已清除;如果是,该函数返回-1,因为释放 dentry 可能会触发基于磁盘的文件系统上的操作。页框回收是通过调用有效完成的
prune_dcache( )。此函数扫描未使用的目录项列表(其头存储在变量中
dentry_unused),直到达到请求的已释放对象数量或扫描整个列表。对于最近未引用的每个对象,该函数:
The shrink_dcache_memory( )
function receives as its parameters the number of page frames to
reclaim and a GFP mask. It starts by checking whether the _ _GFP_FS bit in the GFP mask is clear; if
so, the function returns -1,
because releasing a dentry may trigger an operation on a disk-based
filesystem. Page frame reclaiming is effectively done by invoking
prune_dcache( ). This function
scans the list of unused dentries—whose head is stored in the
dentry_unused variable—until it
reaches the requested number of freed objects or until the whole
list is scanned. On each object that wasn't recently referenced, the
function:
从 dentry 哈希表、其父目录中的 dentry 对象列表以及所有者 inode 的 dentry 对象列表中删除 dentry 对象。
Removes the dentry object from the dentry hash table, from the list of dentry objects in its parent directory, and from the list of dentry objects of the owner inode.
d_iput通过调用dentry 方法(如果已定义)或函数来减少 dentry 索引节点的使用计数器iput(
)。
Decreases the usage counter of the dentry's inode by
invoking the d_iput dentry
method, if defined, or the iput(
) function.
调用d_release
dentry 对象的方法(如果已定义)。
Invokes the d_release
method of the dentry object, if defined.
调用该call_rcu( )
函数来注册一个回调函数,该函数将删除 dentry 对象(请参阅第 5 章中的“读-复制更新(RCU) ”部分)。反过来,回调函数将调用以将对象释放到slab分配器(请参阅第8章中的“释放Slab对象”部分)。kmem_cache_free( )
Invokes the call_rcu( )
function to register a callback function that will remove the
dentry object (see the section "Read-Copy Update
(RCU)" in Chapter
5). The callback function, in turn, will invoke kmem_cache_free( ) to release the
object to the slab allocator (see the section "Freeing a Slab
Object" in Chapter
8).
减少父目录的使用计数器。
Decreases the usage counter of the parent directory.
最后,shrink_dcache_memory(
)根据目录项缓存中仍包含的未使用目录项数量返回一个值。更准确地说,返回值是未使用的 dentry 数量乘以 100 再除以sysctl_vfs_cache_pressure全局变量的内容。默认情况下,该变量等于 100,因此返回值本质上是未使用的 dentry 的数量。但是,系统管理员可以通过写入/proc/sys/vm/vfs_cache_Pressure或发出合适的命令来修改该变量sysctl( )
系统调用。将此变量设置为小于 100 的值会导致shrink_slab( )从 dentry 缓存(和 inode 缓存;请参阅下一节)回收的页面相对于从 LRU 列表回收的页面更少;相反,将该变量设置为大于 100 的值会导致shrink_slab( )从 dentry 和 inode 缓存中回收比从 LRU 列表回收的页面更多的页面。
Finally, shrink_dcache_memory(
) returns a value based on the number of unused dentries
still contained in the dentry cache. More precisely, the returned
value is the number of unused dentries multiplied by 100 and divided
by the content of the sysctl_vfs_cache_pressure global variable.
By default, this variable is equal to 100, thus the returned value
is essentially the number of unused dentries. However, the system
administrator may modify the variable by writing in the /proc/sys/vm/vfs_cache_pressure or by
issuing a suitable sysctl( )
system call. Setting this variable to a value smaller
than 100 causes shrink_slab( ) to
reclaim fewer pages from the dentry cache (and the inode cache; see
the next section) with respect to the pages reclaimed from the LRU
lists; conversely, setting the variable to a value greater than 100
causes shrink_slab( ) to reclaim
more pages from the dentry and inode caches with respect to the
pages reclaimed from the LRU lists.
调用该shrink_icache_memory(
)函数从 inode 缓存中删除未使用的 inode 对象;这里,“未使用”意味着该 inode 不再具有控制 dentry 对象。该功能与前面描述的类似shrink_dcache_memory( )。它检查参数_
_GFP_FS中的位gfp_mask,然后调用该
prune_icache( )函数,最后返回一个值,该值基于仍包含在 inode 缓存中的未使用 inode 的数量和变量的值sysctl_vfs_cache_pressure,如前所述。
The shrink_icache_memory(
) function is invoked to remove unused inode objects from
the inode cache; here, "unused" means that the inode no longer has a
controlling dentry object. The function is similar to the shrink_dcache_memory( ) described
previously. It checks the _
_GFP_FS bit in the gfp_mask parameter, then it invokes the
prune_icache( ) function, and
finally it returns a value based both on the number of unused inodes
still included in the inode cache and the value of the sysctl_vfs_cache_pressure variable, as
previously.
该prune_icache( )
函数依次扫描列表(参见第 12 章中的“ Inode 对象”
inode_unused部分);要释放 inode,该函数会释放与该 inode 关联的所有私有缓冲区,它会使页面缓存中引用该 inode 且不再使用的干净页框失效,然后使用 和 函数来销毁该inode
。索引节点对象。clear_inode( )destroy_inode( )
The prune_icache( )
function, in turn, scans the inode_unused list (see the section "Inode Objects" in
Chapter 12); to free an
inode, the function releases any private buffer associated with the
inode, it invalidates the clean page frames in the page cache that
refer to the inode and are not longer in use, and then it makes use
of the clear_inode( ) and
destroy_inode( ) functions to
destroy the inode object.
PFRA 定期回收通过使用两种不同的机制:
kswapd 内核线程,它调用shrink_zone( )并shrink_slab( )从 LRU 列表中回收页面,以及cache_reap定期调用该函数以从slab 分配器中回收未使用的slab 的函数。
The PFRA performs periodic reclaiming by using two different mechanisms: the
kswapd kernel threads, which invoke shrink_zone( ) and shrink_slab( ) to reclaim pages from the LRU
lists, and the cache_reap function,
which is invoked periodically to reclaim unused slabs from the slab
allocator.
kswapd内核线程是激活页框回收的另一种内核机制。为什么有必要?try_to_free_pages( )当可用内存变得非常稀缺并且发出另一个内存分配请求时调用是否还不够?
The kswapd kernel threads are
another kernel mechanism that activates page frame reclaiming. Why
is it necessary? Is it not sufficient to invoke try_to_free_pages( ) when free memory
becomes really scarce and another memory allocation request is
issued?
不幸的是,这种情况并非如此。一些内存分配请求是由中断和异常处理程序执行的,它们不能阻塞当前进程等待页框被释放;此外,某些内存分配请求是由内核控制路径完成的,这些路径已经获得了对关键资源的独占访问权限,因此无法激活 I/O 数据传输。在极少数情况下,所有内存分配请求都是由此类内核控制路径完成的,内核永远无法释放内存。
Unfortunately, this is not the case. Some memory allocation requests are performed by interrupt and exception handlers, which cannot block the current process waiting for a page frame to be freed; moreover, some memory allocation requests are done by kernel control paths that have already acquired exclusive access to critical resources and that, therefore, cannot activate I/O data transfers. In the infrequent case in which all memory allocation requests are done by such sorts of kernel control paths, the kernel is never able to free memory.
kswapd内核线程还可以在机器空闲时间内保持空闲内存,从而对系统性能产生有益的影响。因此,进程可以更快地获取页面。
The kswapd kernel threads also have a beneficial effect on system performance by keeping memory free in what would otherwise be idle time for the machine; processes can thus get their pages much faster.
每个内存节点都有一个不同的kswapd内核线程(请参阅第8章中的“非统一内存访问(NUMA) ”部分)。每个这样的线程通常都在以节点描述符字段为头的等待队列中休眠。但是,如果发现适合内存分配的所有内存区域的空闲页帧数量低于“警告”阈值(本质上是基于内存区域描述符的 和字段的值),则该函数会唤醒
kswapd内核相应内存节点的线程(参见“区域分配器”部分)kswapd_wait_
_alloc_pages( )pages_lowprotection”在
第 8 章中。)本质上,内核开始回收一些页帧,以避免更严重的“内存不足”情况。
There is a different kswapd kernel thread
for each memory node (see the section "Non-Uniform Memory Access
(NUMA)" in Chapter
8). Each such thread is usually sleeping in the wait queue
headed at the kswapd_wait field
of the node descriptor. However, if _
_alloc_pages( ) discovers that all memory zones suitable
for a memory allocation have a number of free page frames below a
"warning" threshold—essentially, a value based on the pages_low and protection fields of the memory zone
descriptor—then the function wakes up the
kswapd kernel threads of the corresponding
memory nodes (see the section "The Zone Allocator" in
Chapter 8.) Essentially,
the kernel starts to reclaim some page frames in order to avoid much
more dramatic "low on memory" conditions.
正如第 8 章“保留页帧池”部分所述,每个区域描述符还包括一个字段(指定应始终保留的空闲页帧的最小数量的阈值)和一个
字段(指定应始终保留的空闲页帧的最小数量的阈值)。空闲页面框架的“安全”数量,超过该数量应停止页面框架回收。pages_minpages_high
As explained in the section "The Pool of Reserved Page
Frames" in Chapter
8, every zone descriptor also includes a pages_min field—a threshold that specifies
the minimum number of free page frames that should always be
preserved—and a pages_high
field—a threshold that specifies the "safe" number of free page
frames above which page frame reclaiming should be stopped.
kswapd内核线程执行该
函数kswapd( )。它通过将内核线程绑定到可以访问内存节点的CPU,通过在current->reclaim_state进程描述符的字段中存储描述符的地址(参见本章前面的reclaim_state描述中的步骤3d )来初始化内核线程,并通过在字段中try_to_free_pages(
)设置PF_MEMALLOC和PF_KSWAP标志current->flags- 这些标志表明进程正在回收内存,并且允许在执行其工作时使用所有可用的空闲内存。每次唤醒kswapdkswapd( )内核线程时,该函数都会执行以下步骤:
The kswapd kernel thread executes the
kswapd( ) function. It
initializes the kernel thread by binding the kernel thread to the
CPUs that may access the memory node, by storing in the current->reclaim_state field of the
process descriptor the address of a reclaim_state descriptor (see step 3d in
the description of try_to_free_pages(
) earlier in this chapter), and by setting the PF_MEMALLOC and PF_KSWAP flags in the current->flags field—these flags
indicate that the process is reclaiming memory and that it is
allowed to use all the free memory available when doing its job.
Every time the kswapd kernel thread is
awakened, the kswapd( ) function
performs essentially the following steps:
调用finish_wait( )
以从节点的等待队列中删除内核线程(请参阅第 3 章中的“进程如何组织”kswapd_wait部分)。
Invokes finish_wait( )
to remove the kernel thread from the node's kswapd_wait wait queue (see the
section "How
Processes Are Organized" in Chapter 3).
调用以在kswapd的内存节点balance_pgdat(
)上执行内存回收
(见下文)。
Invokes balance_pgdat(
) to perform the memory reclaiming on the
kswapd's memory node (see below).
调用prepare_to_wait(
)以将进程设置为状态TASK_INTERRUPTIBLE并将其置于节点的kswapd_wait等待队列中的睡眠状态。
Invokes prepare_to_wait(
) to set the process in the TASK_INTERRUPTIBLE state and to put it
to sleep in the node's kswapd_wait wait queue.
调用schedule( )将 CPU 让给其他一些可运行的进程。
Invokes schedule( ) to
yield the CPU to some other runnable process.
该balance_pgdat( )
函数依次执行以下基本步骤:
The balance_pgdat( )
function performs, in turn, the following basic steps:
设置scan_control
描述符(参见本章前面的表 17-2 )。
Sets up a scan_control
descriptor (see Table 17-2 earlier
in this chapter).
将temp_priority
内存节点中每个区域描述符的字段设置为12(最低优先级)。
Sets the temp_priority
field of each zone descriptor in the memory node to 12 (lowest
priority).
执行最多 13 次迭代的循环,从优先级 12 降到 0;在每次迭代中执行以下子步骤:
扫描内存区域以查找空闲页框数量不足的最高区域(从
ZONE_DMA到)。ZONE_HIGHMEM每个测试都是通过执行第 8 章“区域分配器”zone_watermark_ok(
)部分中描述的函数来完成的。如果所有区域都有大量空闲页框,则跳转到步骤4。
再次扫描内存区域,从ZONE_DMA步骤 3a 中找到的区域开始。对于每个区域,如有必要,它会使用prev_priority当前优先级更新区域描述符的字段,并连续调用shrink_zone(
)以从区域回收页面(请参阅前面的“低内存回收”部分)。接下来,它调用shrink_slab( )从可收缩磁盘缓存回收页面(请参阅前面的“回收可收缩磁盘缓存页面”)。
如果至少有 32 个页面已被回收,则会中断循环并跳转到步骤 4。
Performs a loop of at most 13 iterations, from priority 12 down to 0; in each iteration performs the following substeps:
Scans the memory zones to find the highest zone (from
ZONE_DMA to ZONE_HIGHMEM) having an
insufficient number of free page frames. Each test is done
by executing the zone_watermark_ok(
) function described in the section "The Zone
Allocator" in Chapter 8. If all zones
have a large number of free page frames, it jumps to step
4.
Scans again the memory zones proceeding from ZONE_DMA to the zone found in step
3a. For each zone, it updates, if necessary, the prev_priority field of the zone
descriptor with the current priority level, and invokes
successively shrink_zone(
) to reclaim pages from the zone (see the earlier
section "Low On
Memory Reclaiming"). Next, it invokes shrink_slab( ) to reclaim pages
from the shrinkable disk caches (see the earlier section
"Reclaiming
Pages of Shrinkable Disk Caches").
If at least 32 pages have been reclaimed, it breaks the loop and jumps to step 4.
prev_priority使用存储在相应字段中的值更新每个区域描述符的字段temp_priority。
Updates the prev_priority field of each zone
descriptor with the value stored in the corresponding temp_priority field.
如果某些“内存不足”区域仍然存在,则调用
schedule( )是否need_resched设置了进程的字段;再次执行时,会跳回步骤1。
If some "low on memory" zone still exists, it invokes
schedule( ) if the need_resched field of the process is
set; when in execution again, it jumps back to step 1.
返回回收的页数。
Returns the number of pages reclaimed.
PFRA还必须回收slab分配器缓存所拥有的页面(参见第8章中的“内存区域管理”部分)。为此,它依赖于在预定义事件工作队列
中定期调度的函数(大约每两秒一次) (请参阅第 4 章中的“工作队列”部分)。函数的地址存储在类型为每CPU变量的
字段中。cache_reap( )cache_reap(
)funcreap_workwork_struct
The PFRA must also reclaim the pages owned by the slab
allocator caches (see the section "Memory Area Management "
in Chapter 8). To do this,
it relies on the cache_reap( )
function, which is periodically scheduled—approximately once every
two seconds—in the predefined events work queue
(see the section "Work
Queues" in Chapter
4). The address of the cache_reap(
) function is stored in the func field of the reap_work per-CPU variable of type
work_struct.
该cache_reap( )函数主要执行以下步骤:
The cache_reap( ) function
essentially performs the following steps:
尝试获取cache_chain_sem信号量,该信号量保护slab缓存描述符列表;如果信号量已被占用,它将调用schedule_delayed_work( )以安排函数的下一次调用,然后终止。
Tries to acquire the cache_chain_sem semaphore, which
protects the list of slab cache descriptors; if the semaphore is
already taken, it invokes schedule_delayed_work( ) to schedule
the next invocation of the function, and terminates.
否则,扫描列表kmem_cache_t中收集的描述符cache_chain。对于找到的每个缓存描述符,该函数执行以下步骤:
如果SLAB_NO_REAP
设置了缓存描述符中的标志,则页框回收已被禁用,因此它将继续处理列表中的下一个缓存。
清空slab本地缓存(参见第8章中的“空闲Slab对象的本地缓存”部分);此操作可能会导致新的平板空闲。
每个缓存都有一个“获取时间”,存储在缓存描述符内部结构next_reap体的字段中(参见第8章“缓存描述符”kmem_list3一节);如果仍然小于
,则继续处理列表中的下一个缓存。jiffiesnext_reap
将字段中的下一个“收获时间”设置next_reap为距离当前时间四秒的值。
在多处理器系统中,该函数会耗尽slab共享缓存(请参阅第8章中的“空闲Slab对象的本地缓存”部分);此操作可能会导致新的平板空闲。
如果最近将新的slab添加到缓存中(即,如果设置了缓存描述符内free_touched的结构标志kmem_list3),则它会跳过此缓存并继续处理列表中的下一个缓存。
根据启发式公式计算要释放的板数量。基本上,这个数量取决于缓存中空闲对象的上限以及打包到单个slab中的对象数量。
重复调用slab_destroy( )缓存的空闲slab列表中包含的slab,直到列表为空或者达到目标空闲slab数量。
调用cond_resched(
)以检查TIF_NEED_RESCHED当前进程的标志,schedule( )如果设置了标志,则调用 。
Otherwise, scans the kmem_cache_t descriptors collected in
the cache_chain list. For
each cache descriptor found, the function performs the following
steps:
If the SLAB_NO_REAP
flag in the cache descriptor is set, page frame reclaiming
has been disabled, hence it continues with the next cache in
the list.
Drains the slab local cache (see the section "Local Caches of Free Slab Objects" in Chapter 8); this operation could cause new slabs to become free.
Each cache has a "reap time" stored in the next_reap field of the kmem_list3 structure inside the
cache descriptor (see the section "Cache
Descriptor" in Chapter 8); if jiffies is still smaller than
next_reap, it continues
with the next cache in the list.
Sets the next "reap time" in the next_reap field to a value four
seconds from the current time.
In multiprocessor systems, the function drains the slab shared cache (see the section "Local Caches of Free Slab Objects" in Chapter 8); this operation could cause new slabs to become free.
If a new slab has been recently added to the
cache—that is, if the free_touched flag of the kmem_list3 structure inside the
cache descriptor is set—it skips this cache and continues
with the next cache in the list.
Computes according to a heuristic formula the number of slabs to be freed. Basically, this number depends on the upper limit of free objects in the cache and on the number of objects packed into a single slab.
Repeatedly invokes slab_destroy( ) on the slabs
included in the list of free slabs of the cache, until the
list is empty or the target number of free slab has been
reached.
Invokes cond_resched(
) to check the TIF_NEED_RESCHED flag of the
current process and to invoke schedule( ), if the flag is
set.
释放cache_chain_sem信号量。
Releases the cache_chain_sem semaphore.
调用schedule_delayed_work(
)以安排函数的下一次调用,然后终止。
Invokes schedule_delayed_work(
) to schedule the next invocation of the function, and
terminates.
尽管 PFRA 努力保留空闲页帧,但虚拟内存子系统的压力可能会变得如此之高,以至于所有可用内存都会耗尽。这种情况可能会很快导致系统中的每个活动冻结:内核不断尝试释放内存以满足某些紧急请求,但由于交换区域已满并且所有磁盘缓存都已缩小,因此它不会成功。因此,没有进程可以继续执行,因此没有进程最终会释放它拥有的页框。
Despite the PFRA effort to keep a reserve of free page frames, it is possible for the pressure on the virtual memory subsystem to become so high that all available memory becomes exhausted. This situation could quickly induce a freeze of every activity in the system: the kernel keeps trying to free memory in order to satisfy some urgent request, but it does not succeed because the swap areas are full and all disk caches have already been shrunken. As a consequence, no process can proceed with its execution, thus no process will eventually free up the page frames that it owns.
为了应对这种戏剧性的情况,PFRA 使用所谓的内存不足 (OOM) 杀手,它会选择系统中的一个进程并突然终止它以释放其页帧。OOM杀手就像一位外科医生为了挽救一个人的生命而截肢:失去肢体并不是一件好事,但有时也没有更好的办法。
To cope with this dramatic situation, the PFRA makes use of a so-called out of memory (OOM) killer, which selects a process in the system and abruptly kills it to free its page frames. The OOM killer is like a surgeon that amputates the limb of a man to save his life: losing a limb is not a nice thing, but sometimes there is nothing better to do.
当可用内存非常低且 PFRA 未成功回收任何页框时out_of_memory( )调用该函数(请参阅第 8 章中的“区域分配器”
部分)。该函数调用以在现有进程中选择受害者,然后调用以执行牺牲。_ _alloc_pages( )select_bad_process( )oom_kill_process( )
The out_of_memory( ) function
is invoked by _ _alloc_pages( )
when the free memory is very low and the PFRA has not succeeded in
reclaiming any page frames (see the section "The Zone Allocator" in
Chapter 8). The function
invokes select_bad_process( ) to
select a victim among the existing processes, then invokes oom_kill_process( ) to perform the
sacrifice.
当然,select_bad_process(
)并不是简单地随机选择一个进程。所选工艺应满足几个要求:
Of course, select_bad_process(
) does not simply pick a process at random. The selected
process should satisfy several requisites:
受害者应该拥有大量的页框,因此可以释放的内存量很大。(作为针对“fork-bomb”进程的对策,该函数还考虑了父级拥有的所有子级占用的内存量。)
The victim should own a large number of page frames, so that the amount of memory that can be freed is significant. (As a countermeasure against the "fork-bomb" processes, the function considers the amount of memory eaten by all children owned by the parent, too.)
杀死受害者应该会损失少量工作——杀死已经运行了数小时或数天的批处理进程并不是一个好主意。
Killing the victim should lose a small amount of work—it is not a good idea to kill a batch process that has been working for hours or days.
受害者应该是低静态优先级进程——用户倾向于将较低优先级分配给不太重要的进程。
The victim should be a low static priority process—the users tend to assign lower priorities to less important processes.
受害者不应该是具有root权限的进程——它们通常执行重要的任务。
The victim should not be a process with root privileges—they usually perform important tasks.
受害者不应直接访问硬件设备(例如 X Window 服务器),因为硬件可能会处于不可预测的状态。
The victim should not directly access hardware devices (such as the X Window server), because the hardware could be left in an unpredictable state.
受害者不能是交换程序(进程 0)、init(进程 1)或任何其他内核线程。
The victim cannot be swapper (process 0), init (process 1), or any other kernel thread.
该select_bad_process( )
函数扫描系统中的每个进程,使用经验公式根据上述规则计算一个值,该值表示选择该进程的好坏,并返回“最佳”驱逐候选者的进程描述符地址。然后,该out_of_memory( )函数调用oom_kill_process( )发送致命信号——通常是SIGKILL;参见第 11 章——要么是该进程的子进程,要么是进程本身(如果不可能的话)。该
oom_kill_process( )函数还会杀死与所选受害者共享相同内存描述符的所有克隆。
The select_bad_process( )
function scans every process in the system, uses an empirical formula
to compute from the above rules a value that denotes how good
selecting that process is, and returns the process descriptor address
of the "best" candidate for eviction. Then, the out_of_memory( ) function invokes oom_kill_process( ) to send a deadly
signal—usually SIGKILL; see Chapter 11—either to a child of
that process or, if this is not possible, to the process itself. The
oom_kill_process( ) function also
kills all clones that share the same memory descriptor with the
selected victim.
正如您在阅读本章时可能已经意识到的那样,Linux VM 子系统(尤其是 PFRA)是一段非常复杂的代码,很难预测其在任意工作负载下的行为。此外,在某些情况下,VM 子系统会表现出病态行为。一个例子是所谓的 交换颠簸 现象:本质上,当系统可用内存不足时,PFRA 会通过将页面写入磁盘并从某些进程中窃取底层页帧来努力释放内存;然而,与此同时,这些进程想要继续执行,因此它们会努力访问它们的页面。因此,内核将 PFRA 刚刚释放的页框分配给进程,并从磁盘读取其内容。最终结果是页面不断地写入磁盘并从磁盘读回;大部分时间都花在访问磁盘上,因此没有进程在终止方面取得实质性进展。
As you might have realized while reading this chapter, the Linux VM subsystem—and particularly the PFRA—is so complex a piece of code that is quite hard to predict its behavior with an arbitrary workload. There are cases, moreover, in which the VM subsystem exhibits pathological behaviors. An example is the so-called swap thrashing phenomenon: essentially, when the system is short of free memory, the PFRA tries hard to free memory by writing pages to disk and stealing the underlying page frames from some processes; at the same time, however, these processes want to proceed with their executions, hence they try hard to access their pages. As a consequence, the kernel assigns to the processes the page frames just freed by the PFRA and reads their contents from disk. The net result is that pages are continuously written to and read back from the disk; most of the time is spent accessing the disk, hence no process makes substantial progress towards its termination.
为了减轻交换抖动的可能性,Jiang 和Zhang 在 2004 年提出的一项技术已在内核版本 2.6.9 中实现:本质上是所谓的交换代币 被分配给系统中的单个进程;该令牌使进程免于页框回收,因此该进程可以取得实质性进展,并且即使在内存不足时也有望终止。
To mitigate the likelihood of swap thrashing, a technique proposed by Jiang and Zhang in 2004 has been implemented in the kernel version 2.6.9: essentially, a so-called swap token is assigned to a single process in the system; the token exempts the process from the page frame reclaiming, so the process can make substantial progress and, hopefully, terminate even when memory is scarce.
交换令牌被实现为swap_token_mm内存描述符指针。当进程拥有交换令牌时,swap_token_mm被设置为进程的内存描述符的地址。
The swap token is implemented as a swap_token_mm memory descriptor pointer.
When a process owns the swap token, swap_token_mm is set to the address of the
process's memory descriptor.
以优雅而简单的方式授予页面框架回收的免疫力。正如我们在“最近最少使用(LRU)列表”部分中所看到的,仅当最近没有被引用时,页面才会从活动列表移至非活动列表。检查是由
page_referenced( )函数完成的,如果页面属于拥有交换令牌的进程的内存区域,该函数会尊重交换令牌并返回 1(引用)。实际上,在某些情况下,不考虑交换令牌:当 PFRA 代表拥有交换令牌的进程执行时,以及当 PFRA 达到页框回收中最难的优先级(级别 0)时。
Immunity from page frame reclaiming is granted in an elegant and
simple way. As we have seen in the section "The Least Recently Used (LRU)
Lists," a page is moved from the active to the inactive list
only if it was not recently referenced. The check is done by the
page_referenced( ) function, which
honors the swap token and returns 1 (referenced) if the page belongs
to a memory region of the process that owns the swap token. Actually,
in a couple of cases the swap token is not considered: when the PFRA
is executing on behalf of the process that owns the swap token, and
when the PFRA has reached the hardest priority level in page frame
reclaiming (level 0).
该grab_swap_token( )
函数确定是否应将交换令牌分配给当前进程。它在每次主要页面错误时被调用,即仅在两种情况下被调用:
The grab_swap_token( )
function determines whether the swap token should be assigned to the
current process. It is invoked on each major page fault, namely on
just two occasions:
When the filemap_nopage(
) function discovers that the required page is not in
the page cache (see the section "Demand Paging for Memory
Mapping" in Chapter
16).
当do_swap_page( )
函数从交换区域读取新页面时(请参阅本章后面的“页面交换”部分)。
When the do_swap_page( )
function has read a new page from a swap area (see the section
"Swapping in
Pages" later in this chapter).
该grab_swap_token( )
函数在分配令牌之前进行一些检查。特别是,如果满足以下所有条件,则将授予令牌:
The grab_swap_token( )
function makes some checks before assigning the token. In particular,
the token is granted if all of the following conditions hold:
自上次调用以来至少已过去两秒grab_swap_token( )。
At least two seconds have elapsed since the last invocation
of grab_swap_token( ).
当前的令牌持有进程自上次执行 以来没有引发主要页面错误grab_swap_token( ),或者至少自tick以来一直持有令牌swap_token_default_timeout。
The current token-holding process has not raised a major
page fault since the last execution of grab_swap_token( ), or has been holding
the token since at least swap_token_default_timeout ticks.
交换令牌最近尚未分配给当前进程。
The swap token has not been recently assigned to the current process.
理想情况下,令牌持有时间应该相当长,即使是几分钟,因为目标是让进程完成其执行。在 Linux 2.6.11 中,令牌持有时间默认设置为非常低的值,即 1 个刻度。但是,系统管理员可以swap_token_default_timeout通过写入/proc/sys/vm/swap_token_default_timeout
文件或发出适当的命令来调整该变量的值sysctl(
) 系统调用。
The token holding time should ideally be rather long, even in
the order of minutes, because the goal is to allow a process to finish
its execution. In Linux 2.6.11 the token holding time is set by
default to a very low value, namely one tick. However, the system
administrator can tune the value of the swap_token_default_timeout variable by
writing in the /proc/sys/vm/swap_token_default_timeout
file or by issuing a proper sysctl(
) system call.
当一个进程被杀死时,内核会检查该进程是否持有交换令牌,如果是,则释放它;这是由函数完成的(参见第 9 章中的“内存描述符”mmput( )部分)。
When a process is killed, the kernel checks whether that process
was holding the swap token and, if so, releases it; this is done by
the mmput( ) function (see the
section "The Memory
Descriptor" in Chapter
9).
[ * ] “交换趋势”这个名称有点误导,因为用户模式地址空间中的页面可以是可交换的、可同步的或可丢弃的。然而,互换趋势值当然控制 PFRA 执行的交换量,因为几乎所有可交换页面都属于用户模式地址空间。
[*] The name "swap tendency" is a bit misleading, because the pages in User Mode address spaces can be swappable, syncable, or discardable. However, the swap tendency value certainly controls the amount of swapping performed by the PFRA, because almost all swappable pages belong to the User Mode address spaces.
[ * ]注意,不属于任何用户态进程地址空间的页面被移至非活动列表中;然而,由于它的PG_referenced标志没有被清除,第一次访问该页面会导致该mark_page_accessed( )函数将该页面移回到活动列表中(见图17-4)。
[*] Notice that a page that does not belong to any User
Mode process address space is moved into the inactive list;
however, since its PG_referenced flag is not cleared,
the first access to the page causes the mark_page_accessed( ) function to
move the page back into the active list (see Figure
17-4).
引入交换是为了在磁盘上为未映射的页面提供备份。从前面的讨论中我们知道,交换子系统必须处理三种类型的页面:
Swapping has been introduced to offer a backup on disk for unmapped pages. We know from the previous discussion that there are three kinds of pages that must be handled by the swapping subsystem:
属于进程匿名内存区域的页面(用户模式堆栈或堆)
Pages that belong to an anonymous memory region of a process (User Mode stack or heap)
属于进程的私有内存映射的脏页
Dirty pages that belong to a private memory mapping of a process
Pages that belong to an IPC shared memory region (see the section "IPC Shared Memory" in Chapter 19)
与请求调页一样,交换必须对程序透明。换句话说,不需要在代码中插入与交换相关的特殊指令。要了解如何做到这一点,请回忆一下第 2 章中的“常规分页”部分
,每个页表条目都包含一个标志。内核利用此标志来表示属于进程地址空间的页面已被换出。除了该标志之外,Linux 还利用页表条目的剩余位将“换出页标识符”存储到其中,该标识符对磁盘上换出页的位置进行编码。当出现页面错误时Present发生异常时,相应的异常处理程序可以检测到该页面不存在于 RAM 中,并调用从磁盘交换丢失页面的函数。
Like demand paging, swapping must be transparent to programs. In
other words, no special instruction related to swapping needs to be
inserted into the code. To understand how this can be done, recall from
the section "Regular
Paging" in Chapter 2
that each Page Table entry includes a Present flag. The kernel exploits this flag to
signal that a page belonging to a process address space has been swapped
out. Besides that flag, Linux also takes advantage of the remaining bits
of the Page Table entry to store into them a "swapped-out page
identifier" that encodes the location of the swapped-out page on disk.
When a Page Fault exception occurs, the corresponding exception handler can
detect that the page is not present in RAM and invoke the function that
swaps in the missing page from disk.
交换子系统的主要特点可概括如下:
The main features of the swapping subsystem can be summarized as follows:
在磁盘上设置“交换区域”来存储没有磁盘映像的页面。
Set up "swap areas" on disk to store pages that do not have a disk image.
管理交换区域上的空间,根据需要分配和释放“页槽”。
Manage the space on swap areas allocating and freeing "page slots" as the need occurs.
提供将页面从 RAM“换出”到交换区域以及将页面从交换区域“换入”RAM 的功能。
Provide functions both to "swap out" pages from RAM into a swap area and to "swap in" pages from a swap area into RAM.
利用当前换出页面的页表条目中的“换出页面标识符”来跟踪交换区域中数据的位置。
Make use of "swapped-out page identifiers" in the Page Table entries of pages that are currently swapped out to keep track of the positions of data in the swap areas.
综上所述,交换是页框回收的最大特点。如果我们想确保 PFRA 可以随意回收某个进程获取的所有页框(而不仅仅是那些包含磁盘上有图像的页的页框),则必须使用交换。当然,您可以使用swapoff命令关闭交换;然而,在这种情况下,当系统负载增加时,磁盘抖动可能会更快发生。
To sum up, swapping is the crowning feature of page frame reclaiming. If we want to be sure that all the page frames obtained by a process, and not only those containing pages that have an image on disk, can be reclaimed at will by the PFRA, then swapping has to be used. Of course, you might turn off swapping by using the swapoff command; in this case, however, disk thrashing is likely to occur sooner when the system load increases.
我们还应该提到,交换可用于扩展用户模式进程可有效使用的内存地址空间。事实上,大的交换区域允许内核启动多个要求较高的应用程序,这些应用程序的总内存请求超过了系统中安装的物理 RAM 的数量。然而,RAM的模拟在性能方面并不像RAM。进程对当前换出页面的每次访问比对 RAM 中页面的访问要长几个数量级。简而言之,如果性能非常重要,那么交换只能作为最后的手段;添加 RAM 芯片仍然是应对不断增长的计算需求的最佳解决方案。
We should also mention that swapping can be used to expand the memory address space that is effectively usable by the User Mode processes. In fact, large swap areas allow the kernel to launch several demanding applications whose total memory requests exceed the amount of physical RAM installed in the system. However, simulation of RAM is not like RAM in terms of performance. Every access by a process to a page that is currently swapped out is of several orders of magnitude longer than an access to a page in RAM. In short, if performance is of great importance, swapping should be used only as a last resort; adding RAM chips still remains the best solution to cope with increasing computing needs.
从内存中换出的页面存储在交换区域中,交换区域可以实现为其自己的磁盘分区,也可以实现为包含在更大分区中的文件。几个不同的交换区域可以定义,最多为
MAX_SWAPFILES宏指定的最大数量(通常设置为 32)。
The pages swapped out from memory are stored in a swap
area, which may be implemented either as a disk partition
of its own or as a file included in a larger partition. Several
different swap areas may be defined, up to a maximum number specified by the
MAX_SWAPFILES macro (usually set to
32).
拥有多个交换区域允许系统管理员在多个磁盘之间分配大量交换空间,以便硬件可以同时对它们进行操作;它还允许在运行时增加交换空间,而无需重新启动系统。
Having multiple swap areas allows a system administrator to spread a lot of swap space among several disks so that the hardware can act on them concurrently; it also lets swap space be increased at runtime without rebooting the system.
每个交换区域由一系列页槽组成 :4,096 字节的块用于包含换出的页面。交换区的第一个页槽用于持久存储交换区的一些信息;其格式由两个结构体和swap_header组成的并集来描述
。该结构提供了一个字符串,将磁盘的一部分明确标记为交换区域;它仅包含一个字段 ,其中包含一个 10 个字符的“magic”字符串。该结构本质上允许内核明确地将文件或分区标识为交换区域;字符串的文本,即“SWAPSPACE2”,始终位于第一个页槽的末尾。infomagicmagicmagic.magicmagic
Each swap area consists of a sequence of page
slots : 4,096-byte blocks used to contain a swapped-out page.
The first page slot of a swap area is used to persistently store some
information about the swap area; its format is described by the
swap_header union composed of two
structures, info and magic. The magic structure provides a string that marks
part of the disk unambiguously as a swap area; it consists of just one
field, magic.magic, which contains
a 10-character "magic" string. The magic structure essentially allows the
kernel to unambiguously identify a file or a partition as a swap area;
the text of the string, namely "SWAPSPACE2," is always located at the
end of the first page slot.
该info结构包括以下字段:
The info structure includes
the following fields:
bootbitsbootbits不被交换算法使用;该字段对应交换区的前1024字节,可以存储分区数据、磁盘标签等。
Not used by the swapping algorithm; this field corresponds to the first 1,024 bytes of the swap area, which may store partition data, disk labels, and so on.
versionversion交换算法版本。
Swapping algorithm version.
last_pagelast_page最后一个有效可用的页槽。
Last page slot that is effectively usable.
nr_badpagesnr_badpages有缺陷的页槽数。
Number of defective page slots.
padding[125]padding[125]填充字节。
Padding bytes.
badpages[1]badpages[1]最多 637 个数字指定有缺陷的页槽的位置。
Up to 637 numbers specifying the location of defective page slots.
只要系统处于开启状态,交换区中存储的数据就有意义。当系统关闭时,所有进程都会被杀死,因此进程在交换区中存储的数据将被丢弃。因此,交换区域包含很少的控制信息:本质上是交换区域类型和有缺陷的页槽列表。此控制信息可轻松容纳在单个 4 KB 页面中。
The data stored in a swap area is meaningful as long as the system is on. When the system is switched off, all processes are killed, so the data stored by processes in swap areas is discarded. For this reason, swap areas contain very little control information: essentially, the swap area type and the list of defective page slots. This control information easily fits in a single 4 KB page.
通常,系统管理员在Linux系统上创建其他分区时都会创建一个交换分区,然后使用mkswap命令将该磁盘区域设置为新的交换区域。该命令初始化刚刚在第一个页槽中描述的字段。由于磁盘可能包含一些坏块,因此程序还会检查所有其他页槽以找到有缺陷的页槽。但执行mkswap命令会使交换区处于非活动状态。每个交换区域可以在系统启动时在脚本文件中激活,也可以在系统运行后动态激活。
Usually, the system administrator creates a swap partition when creating the other partitions on the Linux system, and then uses the mkswap command to set up the disk area as a new swap area. That command initializes the fields just described within the first page slot. Because the disk may include some bad blocks, the program also examines all other page slots to locate the defective ones. But executing the mkswap command leaves the swap area in an inactive state. Each swap area can be activated in a script file at system boot or dynamically after the system is running.
每个交换区由一个或多个交换区组成 ,每个都由一个描述符表示swap_extent。每个范围对应于磁盘上物理上相邻的一组页(或更准确地说,页槽)。因此,swap_extent描述符包括交换区域中盘区的第一页的索引、盘区的页长度以及盘区的起始磁盘扇区号。当激活交换区域本身时,会创建组成交换区域的扩展区的有序列表。存储在磁盘分区中的交换区域仅由一个盘区组成;相反,存储在常规文件中的交换区域可以由多个扩展区组成,因为文件系统可能没有将整个文件分配在磁盘上的连续块中。
Each swap area consists of one or more swap
extents , each of which is represented by a swap_extent descriptor. Each extent
corresponds to a group of pages—or more accurately, page slots—that
are physically adjacent on disk. Hence, the swap_extent descriptor includes the index
of the first page of the extent in the swap area, the length in
pages of the extent, and the starting disk sector number of the
extent. An ordered list of the extents that compose a swap area is
created when activating the swap area itself. A swap area stored in
a disk partition is composed of just one extent; conversely, a swap
area stored in a regular file can be composed of several extents,
because the filesystem may not have allocated the whole file in
contiguous blocks on disk.
换出时,内核尝试将页面存储在连续的页槽中,以最大限度地减少访问交换区域时的磁盘寻道时间;这是高效交换算法的重要元素。
When swapping out, the kernel tries to store pages in contiguous page slots to minimize disk seek time when accessing the swap area; this is an important element of an efficient swapping algorithm.
然而,如果使用多个交换区,事情就会变得更加复杂。更快的交换区域(存储在更快的磁盘中的交换区域)获得更高的优先级。当寻找空闲槽时,搜索从具有最高优先级的交换区域开始。如果有多个交换区,则循环选择相同优先级的交换区,以避免其中一个交换区过载。如果在具有最高优先级的交换区域中找不到空闲插槽,则在优先级次于最高优先级的交换区域中继续搜索,依此类推。
However, if more than one swap area is used, things become more complicated. Faster swap areas—swap areas stored in faster disks—get a higher priority. When looking for a free slot, the search starts in the swap area that has the highest priority. If there are several of them, swap areas of the same priority are cyclically selected to avoid overloading one of them. If no free slot is found in the swap areas that have the highest priority, the search continues in the swap areas that have a priority next to the highest one, and so on.
每个活跃的交换区在内存中有自己的swap_info_struct描述符。描述符的字段如表17-3所示。
Each active swap area has its own swap_info_struct descriptor in memory. The
fields of the descriptor are illustrated in Table 17-3.
表 17-3。交换区描述符的字段
Table 17-3. Fields of a swap area descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 交换区域标志 Swap area flags |
| | 自旋锁保护交换区 Spin lock protecting the swap area |
| | 指向存储交换区的常规文件或设备文件的文件对象的指针 Pointer to the file object of the regular file or device file that stores the swap area |
| | 包含交换区的块设备的描述符 Descriptor of the block device containing the swap area |
| | 组成交换区的盘区列表的头部 Head of the list of extents that compose the swap area |
整数 int | nr_范围 nr_extents | 组成交换区的盘区数量 Number of extents composing the swap area |
结构体 struct 交换范围 * swap_extent * | curr_swap_范围 curr_swap_extent | 指向最近使用的盘区描述符的指针 Pointer to the most recently used extent descriptor |
| | 包含交换区域的分区的自然块大小 Natural block size of the partition containing the swap area |
| | 指向计数器数组的指针,每个计数器对应一个交换区域页槽 Pointer to an array of counters, one for each swap area page slot |
| | 搜索免费页面时要扫描的首页位置 First page slot to be scanned when searching for a free one |
| | 搜索免费页面时要扫描的最后一页 Last page slot to be scanned when searching for a free one |
| | 搜索免费页面时要扫描的下一页位置 Next page slot to be scanned when searching for a free one |
| | 从头开始重新启动之前分配的空闲页槽数 Number of free page slot allocations before restarting from the beginning |
| | 交换区域优先级 Swap area priority |
| | 可用页槽数 Number of usable page slots |
| | 交换区域的大小(以页为单位) Size of swap area in pages |
无符号长 unsigned long | 正在使用的页面 inuse_pages | 交换区中已使用的页槽数 Number of used page slots in the swap area |
| | 指向下一个交换区描述符的指针 Pointer to next swap area descriptor |
该flags字段包括三个重叠的子字段:
The flags field includes
three overlapping subfields:
SWP_USEDSWP_USED1 如果交换区处于活动状态;如果不活动则为 0。
1 if the swap area is active; 0 if it is inactive.
SWP_WRITEOKSWP_WRITEOK1 是否可以写入交换区;如果交换区域是只读的(正在激活或停用),则为 0。
1 if it is possible to write into the swap area; 0 if the swap area is read-only (it is being activated or inactivated).
SWP_ACTIVESWP_ACTIVESWP_USED这个 2 位字段实际上是和的组合SWP_WRITEOK;当前面的两个标志都被设置时,该标志也被设置。
This 2-bit field is actually the combination of SWP_USED and SWP_WRITEOK; the flag is set when both
the previous flags are set.
该swap_map字段指向一组计数器,每个计数器对应一个交换区域页槽。如果计数器等于0,则页槽空闲;如果为正,则页槽被换出的页填充。本质上,页槽计数器表示共享换出页的进程数。如果计数器具有该值SWAP_MAP_MAX(等于32, 767),则存储在页槽中的页是“永久的”并且不能从相应槽中移除。如果计数器具有该值SWAP_MAP_BAD(等于 32,768),则该页槽被视为有缺陷,因此无法使用。[ * ]
The swap_map field points to
an array of counters, one for each swap area page slot. If the counter
is equal to 0, the page slot is free; if it is positive, the page slot
is filled with a swapped-out page. Essentially, the page slot counter
denotes the number of processes that share the swapped-out page. If
the counter has the value SWAP_MAP_MAX (equal to 32, 767), the page
stored in the page slot is "permanent" and cannot be removed from the
corresponding slot. If the counter has the value SWAP_MAP_BAD (equal to 32,768), the page
slot is considered defective, and thus unusable.[*]
该prio字段是一个有符号整数,表示交换子系统应考虑每个交换区域的顺序。
The prio field is a signed
integer that denotes the order in which the swap subsystem should
consider each swap area.
该sdev_lock字段是一个自旋锁,用于保护交换区域的数据结构(主要是交换描述符),防止 SMP 系统中的并发访问。
The sdev_lock field is a spin
lock that protects the swap area's data structures—chiefly, the swap
descriptor—against concurrent accesses in SMP systems.
该swap_info数组包含
MAX_SWAPFILES交换区域描述符。SWP_USED仅使用设置了标志的区域,因为它们是激活的区域。图 17-6说明了该
swap_info数组、一个交换区域以及相应的计数器数组。
The swap_info array includes
MAX_SWAPFILES swap area
descriptors. Only the areas whose SWP_USED flags are set are used, because
they are the activated areas. Figure 17-6 illustrates the
swap_info array, one swap area, and
the corresponding array of counters.
该nr_swapfiles变量存储包含或已经包含已使用交换区描述符的最后一个数组元素的索引。尽管有其名称,该变量
并不包含活动交换区域的数量。
The nr_swapfiles variable
stores the index of the last array element that contains, or that has
contained, a used swap area descriptor. Despite its name, the variable
does not contain the number of active swap
areas.
活动交换区域的描述符也被插入到按交换区域优先级排序的列表中。该列表是通过
next交换区描述符的字段实现的,该字段存储数组中下一个描述符的索引
swap_info。这种将字段用作索引的方式与大多数名称为 的字段不同next,后者通常是指针。
Descriptors of active swap areas are also inserted into a list
sorted by the swap area priority. The list is implemented through the
next field of the swap area
descriptor, which stores the index of the next descriptor in the
swap_info array. This use of the
field as an index is different from most fields with the name next, which are usually pointers.
该swap_list变量的类型为swap_list_t,包含以下字段:
The swap_list variable, of
type swap_list_t, includes the
following fields:
headheadswap_info
数组中第一个列表元素的索引。
Index in the swap_info
array of the first list element.
nextnextswap_info
要选择用于换出页面的下一个交换区域的描述符数组中的索引。该字段用于在具有空闲槽的最大优先级交换区域之间实现循环算法。
Index in the swap_info
array of the descriptor of the next swap area to be selected for
swapping out pages. This field is used to implement a Round
Robin algorithm among maximum-priority swap areas with free
slots.
自旋swaplock锁可保护列表免受多处理器系统中的并发访问。
The swaplock spin lock
protects the list against concurrent accesses in multiprocessor
systems.
交换区域描述符的字段max存储交换区域的大小(以页为单位),而该
pages字段存储可用页槽的数量。这些数字有所不同,因为pages没有考虑第一个页槽和有缺陷的页槽。
The max field of the swap
area descriptor stores the size of the swap area in pages, while the
pages field stores the number of
usable page slots. These numbers differ because pages does not take the first page slot and
the defective page slots into consideration.
最后,该nr_swap_pages
变量包含所有活动交换区域中可用(空闲和无缺陷)页槽的数量,而该total_swap_pages变量包含无缺陷页槽的总数。
Finally, the nr_swap_pages
variable contains the number of available (free and nondefective) page
slots in all active swap areas, while the total_swap_pages variable contains the total
number of nondefective page slots.
通过指定数组中交换区域的索引swap_info和交换区域内的页槽索引,可以非常简单地唯一标识换出的页面。因为交换区域的第一页(索引为 0)是为前面讨论的联合保留的,所以第一个有用的页槽的索引为 1。换出页标识符swap_header的格式
如图 17-7所示
。
A swapped-out page is uniquely identified quite simply
by specifying the index of the swap area in the swap_info array and the page slot index
inside the swap area. Because the first page (with index 0) of the
swap area is reserved for the swap_header union discussed earlier, the
first useful page slot has index 1. The format of a
swapped-out page identifier is illustrated in
Figure 17-7.
该函数根据交换区索引和页槽索引
swp_entry(type,offset)
构造换出页标识符。相反,和函数分别从换出页标识符中提取交换区域索引和页槽索引。typeoffsetswp_typeswp_offset
The swp_entry(type,offset)
function constructs a swapped-out page identifier from the swap area
index type and the page slot index
offset. Conversely, the swp_type and swp_offset functions extract from a
swapped-out page identifier the swap area index and the page slot
index, respectively.
当页面被换出时,其标识符将作为该页面的条目插入到页表中,以便在需要时可以再次找到该页面。请注意,与标志相对应的此类标识符的最低有效位Present
始终会被清除,以表示该页面当前不在 RAM 中。然而,至少必须设置剩余 31 位中的一位,因为交换区 0 的槽 0 中从未存储过任何页。因此,可以根据页表条目的值识别三种不同的情况:
When a page is swapped out, its identifier is inserted as the
page's entry into the Page Table so the page can be found again when
needed. Notice that the least-significant bit of such an identifier,
which corresponds to the Present
flag, is always cleared to denote the fact that the page is not
currently in RAM. However, at least one of the remaining 31 bits has
to be set because no page is ever stored in slot 0 of swap area 0. It
is therefore possible to identify three different cases from the value
of a Page Table entry:
The page does not belong to the process address space, or the underlying page frame has not yet been assigned to the process (demand paging ).
该页面当前已被换出。
The page is currently swapped out.
该页包含在 RAM 中。
The page is contained in RAM.
交换区域的最大大小由可用于标识时隙的位数确定。在 80 × 86 架构上,可用的 24 位将交换区域的大小限制为 2 24 个插槽(即 64 GB)。
The maximum size of a swap area is determined by the number of bits available to identify a slot. On the 80 × 86 architecture, the 24 bits available limit the size of a swap area to 224 slots (that is, to 64 GB).
因为一个页面可能属于地址空间对于多个进程(请参阅前面的“反向映射”一节),它可能会从一个进程的地址空间中换出,但仍保留在主内存中;因此,可以多次换出同一页。当然,页面仅被物理换出并存储一次,但随后每次尝试换出都会增加计数器
swap_map。
Because a page may belong to the address spaces of several processes (see the earlier section "Reverse Mapping"), it may
be swapped out from the address space of one process and still remain
in main memory; therefore, it is possible to swap out the same page
several times. A page is physically swapped out and stored just once,
of course, but each subsequent attempt to swap it out increases the
swap_map counter.
该swap_duplicate( )
函数通常在尝试换出已换出的页面时被调用。它只是验证作为其参数传递的换出页面标识符是否有效,并增加相应的swap_map计数器。更准确地说,它执行以下操作:
The swap_duplicate( )
function is usually invoked while trying to swap out an already
swapped-out page. It simply verifies that the swapped-out page
identifier passed as its parameter is valid and increases the
corresponding swap_map counter.
More precisely, it performs the following actions:
使用swp_type和
swp_offset函数从参数中提取交换区编号和页槽索引。
Uses the swp_type and
swp_offset functions to extract
the swap area number and the page slot index from the
parameter.
检查所识别的交换区号是否处于活动状态;如果不是,则返回 0(无效标识符)。
Checks whether the swap area number identified is active; if not, it returns 0 (invalid identifier).
检查页槽是否有效且不空闲(其
swap_map计数器大于 0 且小于SWAP_MAP_BAD);如果不是,则返回 0(无效标识符)。
Checks whether the page slot is valid and not free (its
swap_map counter is greater
than 0 and less than SWAP_MAP_BAD); if not, it returns 0
(invalid identifier).
否则,换出的页面标识符将定位有效页面。swap_map
如果页槽的计数器尚未达到该值,则
增加该计数器SWAP_MAP_MAX。
Otherwise, the swapped-out page identifier locates a valid
page. Increases the swap_map
counter of the page slot if it has not already reached the value
SWAP_MAP_MAX.
返回 1(有效标识符)。
Returns 1 (valid identifier).
一旦交换区域被初始化,超级用户(或者更准确地说,每个有能力的用户,如第 20 章“进程凭证和功能”CAP_SYS_ADMIN部分中所述)可以使用swapon和swapoff程序来激活和停用交换区域, 分别。这些程序使用和系统调用;我们将简要概述相应的服务例程。swapon( )swapoff( )
Once a swap area is initialized, the superuser (or, more
precisely, every user having the CAP_SYS_ADMIN capability, as described in
the section "Process
Credentials and Capabilities" in Chapter 20) may use the swapon and swapoff programs to activate and deactivate
the swap area, respectively. These programs use the swapon( ) and swapoff( ) system calls; we'll briefly
sketch out the corresponding service routines.
服务sys_swapon( )例程接收以下参数作为其参数:
The sys_swapon( ) service
routine receives the following as its parameters:
specialfilespecialfile该参数指向用于实现交换区的设备文件(分区)或普通文件的路径名(在用户模式地址空间中)。
This parameter points to the pathname (in the User Mode address space) of the device file (partition) or plain file used to implement the swap area.
swap_flagsswap_flags该参数由一位SWAP_FLAG_PREFER加上交换区优先级的 31 位组成(这些位仅在该SWAP_FLAG_PREFER位打开时才有效)。
This parameter consists of a single SWAP_FLAG_PREFER bit plus 31 bits of
priority of the swap area (these bits are significant only if
the SWAP_FLAG_PREFER bit is
on).
swap_header该函数检查创建交换区域时放入第一个槽中的联合的字段。该函数执行以下主要步骤:
The function checks the fields of the swap_header union that was put in the
first slot when the swap area was created. The function performs
these main steps:
检查当前进程是否具有该CAP_SYS_ADMIN能力。
Checks that the current process has the CAP_SYS_ADMIN capability.
在交换区域描述符数组的第一个nr_swapfiles组件
中查找标志swap_info已清除的第一个描述符SWP_USED,这意味着相应的交换区域处于非活动状态。如果发现非活动交换区域,则转至步骤 4。
Looks in the first nr_swapfiles components of the
swap_info array of swap area
descriptors for the first descriptor having the SWP_USED flag cleared, meaning that
the corresponding swap area is inactive. If an inactive swap
area is found, it goes to step 4.
新的交换区域数组索引等于nr_swapfiles:它检查为交换区域索引保留的位数是否足够大以编码新索引;如果不是,则返回错误代码;否则, 的值加一nr_swapfiles。
The new swap area array index is equal to nr_swapfiles: it checks that the
number of bits reserved for the swap area index is sufficiently
large to encode the new index; if not, returns an error code;
otherwise, it increases by one the value of nr_swapfiles.
已找到未使用交换区域的索引:它初始化描述符的字段;特别是,它将
flags、SWP_USED和 设置lowest_bit为highest_bit0。
An index of an unused swap area has been found: it
initializes the descriptor's fields; in particular, it sets
flags to SWP_USED, and sets lowest_bit and highest_bit to 0.
如果该swap_flags
参数指定新交换区域的优先级,则该函数设置prio描述符的字段。否则,它将将该字段初始化为比所有活动交换区域中最低优先级低一的值(因此假设最后激活的交换区域位于最慢的块设备上)。如果没有其他交换区域处于活动状态,则该函数分配值 -1。
If the swap_flags
parameter specifies a priority for the new swap area, the
function sets the prio field
of the descriptor. Otherwise, it initializes the field to one
less than the lowest priority among all active swap areas (thus
assuming that the last activated swap area is on the slowest
block device). If no other swap areas are already active, the
function assigns the value -1.
specialfile从用户模式地址空间复制参数指向的字符串。
Copies the string pointed to by the specialfile parameter from the User
Mode address space.
调用filp_open( )打开由参数指定的文件(参见第12章中的“ open()系统调用”specialfile部分)。
Invokes filp_open( ) to
open the file specified by the specialfile parameter (see the section
"The open( ) System
Call" in Chapter
12).
将返回的文件对象的地址存储
filp_open( )在swap_file交换区域描述符的字段中。
Stores the addresses of the file object returned by
filp_open( ) in the swap_file field of the swap area
descriptor.
通过查看 中的其他活动交换区域,确保交换区域尚未激活swap_info。address_space这是通过检查存储在
swap_file->f_mapping交换区域描述符字段中的对象的地址来完成的。如果交换区已处于活动状态,则返回错误代码。
Makes sure that the swap area is not already activated by
looking at the other active swap areas in swap_info. This is done by checking
the addresses of the address_space objects stored in the
swap_file->f_mapping field
of the swap area descriptors. If the swap area is already
active, it returns an error code.
如果specialfile
参数标识块设备文件,则执行以下子步骤:
If the specialfile
parameter identifies a block device file, it performs the
following substeps:
Invokes bd_claim( )
to set the swapping subsystem as the holder of the block
device (see the section "Block Devices"
in Chapter 14).
If the block device already has a holder, it returns an
error code.
Stores the address of the block_device descriptor in the
bdev field of the swap
area descriptor.
Stores the current block size of the device in the
old_block_size field of
the swap area descriptor, then sets the block size of the
device to 4,096 bytes (the page size).
如果specialfile
参数标识常规文件,它将执行以下子步骤:
检查文件 inode 的字段S_SWAPFILE的字段。i_flags如果设置了此标志,则会返回错误代码,因为该文件已被用作交换区域。
将包含该文件的块设备的描述符地址存储在bdev交换区描述符的字段中。
If the specialfile
parameter identifies a regular file, it performs the following
substeps:
Checks the S_SWAPFILE field of the i_flags field of the file's inode.
If this flag is set, it returns an error code because the
file is already being used as a swap area.
Stores the descriptor address of the block device
containing the file in the bdev field of the swap area
descriptor.
读取swap_header
存储在交换区槽 0 中的描述符。为此,它调用read_cache_page( )
作为参数传递address_space指向的对象
swap_file->f_mapping、页索引 0、文件readpage方法的地址(存储在 中swap_file->f_mapping->a_ops->readpage)以及指向文件对象的指针swap_file。等待页面被读入内存。
Reads the swap_header
descriptor stored in slot 0 of the swap area. To that end, it
invokes read_cache_page( )
passing as parameters the address_space object pointed to by
swap_file->f_mapping, the
page index 0, the address of the file's readpage method (stored in swap_file->f_mapping->a_ops->readpage),
and the pointer to the file object swap_file. Waits until the page has
been read into memory.
检查第一页最后 10 个字符中的魔术字符串是否等于“SWAPSPACE2”。如果不是,则返回错误代码。
Checks that the magic string in the last 10 characters of the first page is equal to "SWAPSPACE2." If not, it returns an error code.
根据联合字段
中存储的交换区域的大小初始化交换区域描述符的
lowest_bit和字段。highest_bitinfo.last_pageswap_header
Initializes the lowest_bit and highest_bit fields of the swap area
descriptor according to the size of the swap area stored in the
info.last_page field of the
swap_header union.
调用vmalloc( )以创建与新交换区域关联的计数器数组,并将其地址存储在swap_map交换描述符的字段中。SWAP_MAP_BAD根据联合字段中存储的缺陷页槽列表,info.bad_pages将数组的元素初始化为 0 或swap_header。
Invokes vmalloc( ) to
create the array of counters associated with the new swap area
and stores its address in the swap_map field of the swap descriptor.
Initializes the elements of the array to 0 or to SWAP_MAP_BAD, according to the list of
defective page slots stored in the info.bad_pages field of the swap_header union.
通过访问第一个页槽中的info.last_page和字段来计算有用页槽的数量
,并将其存储在交换区域描述符的字段中。还设置
交换区域中页面总数的字段。info.nr_badpagespagesmax
Computes the number of useful page slots by accessing the
info.last_page and info.nr_badpages fields in the first
page slot, and stores it in the pages field of the swap area
descriptor. Also sets the max
field with the total number of pages in the swap area.
为新交换区构建extent_list
交换区列表(如果交换区是磁盘分区,则只有一个),并正确设置交换区描述符中的nr_extents和字段。curr_swap_extent
Builds the extent_list
list of swap extents for the new swap area (only one if the swap
area is a disk partition), and sets properly the nr_extents and curr_swap_extent fields in the swap
area descriptor.
将flags交换区描述符的字段设置为SWP_ACTIVE。
Sets the flags field of
the swap area descriptor to SWP_ACTIVE.
更新nr_good_pages、nr_swap_pages和total_swap_pages全局变量。
Updates the nr_good_pages, nr_swap_pages, and total_swap_pages global
variables.
将交换区描述符插入到变量指向的列表中
swap_list。
Inserts the swap area descriptor in the list to which the
swap_list variable
points.
返回 0(成功)。
Returns 0 (success).
服务sys_swapoff( )
例程停用由参数 标识的交换区域
specialfile。它比 更复杂、更耗时sys_swapon(
),因为要停用的分区可能仍包含属于多个进程的页面。因此,该函数被迫扫描交换区域并交换所有现有页面。由于每次换入都需要一个新的页框,因此如果没有剩余的可用页框,则可能会失败。在这种情况下,该函数返回一个错误代码。所有这一切都是通过执行以下主要步骤来实现的:
The sys_swapoff( )
service routine deactivates a swap area identified by the parameter
specialfile. It is much more
complex and time-consuming than sys_swapon(
), since the partition to be deactivated might still
contain pages that belong to several processes. The function is thus
forced to scan the swap area and to swap in all existing pages.
Because each swap-in requires a new page frame, it might fail if
there are no free page frames left. In this case, the function
returns an error code. All this is achieved by performing the
following major steps:
检查当前进程是否具有该CAP_SYS_ADMIN能力。
Checks that the current process has the CAP_SYS_ADMIN capability.
将参数指向的字符串复制specialfile到内核空间中。
Copies the string pointed to by the specialfile parameter in kernel
space.
调用filp_open( )打开参数引用的文件specialfile;像往常一样,该函数返回文件对象的地址。
Invokes filp_open( ) to
open the file referenced by the specialfile parameter; as usual, this
function returns the address of a file object.
扫描swap_list
交换区描述符列表,并将返回的文件对象的地址与活动交换区描述符字段filp_open(
)中存储的地址进行比较。swap_file如果未找到匹配项,则说明向该函数传递了无效参数,因此它会返回错误代码。
Scans the swap_list
list of the swap area descriptor, and compares the address of
the file object returned by filp_open(
) with the addresses stored in the swap_file fields of the active swap
area descriptors. If no match is found, an invalid parameter was
passed to the function, so it returns an error code.
调用cap_vm_enough_memory(
)以检查是否有足够的空闲页框来交换交换区域中存储的所有页面。如果不是,则无法停用交换区;它释放文件对象并返回错误代码。这只是一个粗略的检查,但它可以使内核免受大量无用的磁盘活动的影响。在执行此检查时,cap_vm_enough_memory( )请考虑通过设置了标志的板高速缓存分配的页帧
(请参阅第 8 章中的“将板分配器与分区页帧分配器连接”SLAB_RECLAIM_ACCOUNT部分)。被视为可回收的此类页面的数量存储在slab_reclaim_pages
多变的。
Invokes cap_vm_enough_memory(
) to check whether there are enough free page frames
to swap in all pages stored in the swap area. If not, the swap
area cannot be deactivated; it releases the file object and
returns an error code. This is only a rough check, but it could
save the kernel from a lot of useless disk activity. While
performing this check, cap_vm_enough_memory( ) takes into
account the page frames allocated through slab caches having the
SLAB_RECLAIM_ACCOUNT flag set
(see the section "Interfacing the Slab
Allocator with the Zoned Page Frame Allocator" in Chapter 8). The number of
such pages, which are considered as reclaimable, is stored in
the slab_reclaim_pages
variable.
从列表中删除交换区描述符swap_list。
Removes the swap area descriptor from the swap_list list.
通过减去交换区域描述符字段中的值来更新nr_swap_pages和变量。total_swap_pagespages
Updates the nr_swap_pages and total_swap_pages variables by
subtracting the value in the pages field of the swap area
descriptor.
清除交换区描述符字段SWP_WRITEOK
中的标志;flags这会禁止 PFRA 交换交换区域中的更多页面。
Clears the SWP_WRITEOK
flag in the flags field of
the swap area descriptor; this forbids the PFRA from swapping
out more pages in the swap area.
调用try_to_unuse( )
(见下文)以连续强制交换区域中剩余的所有页面进入 RAM,并相应地更新使用这些页面的进程的页表。执行此函数时,正在执行swapoff命令的当前进程会设置该PF_SWAPOFF标志。设置此标志只有一个后果:在页框严重短缺的情况下,该select_bad_process(
)函数将被迫选择并终止此进程!(请参阅本章前面的“内存不足杀手”部分。)
Invokes try_to_unuse( )
(see below) to successively force all pages left in the swap
area into RAM and to correspondingly update the Page Tables of
the processes that use these pages. While executing this
function, the current process, which is executing the swapoff command, has the PF_SWAPOFF flag set. Setting this flag
has just one consequence: in case of a dramatic shortage of page
frames, the select_bad_process(
) function will be forced to select and kill this
process! (See the section "The Out of Memory
Killer" earlier in this chapter.)
等待,直到包含交换区域的块设备驱动程序被拔出(参见第 14 章中的“激活块设备驱动程序”部分)。这样,提交的读取请求将在交换区域停用之前由驱动程序处理。try_to_unuse( )
Waits until the block device driver that contains the swap
area is unplugged (see the section "Activating the Block
Device Driver" in Chapter 14). In this way,
the reading requests submitted by try_to_unuse( ) will be handled by the
driver before the swap area is deactivated.
如果try_to_unuse( )
分配所有请求的页框失败,则无法停用交换区域。因此,该函数执行以下子步骤:
将交换区描述符重新插入swap_list列表中并将其
flags字段设置为SWP_WRITEOK。
通过添加交换区描述符字段中的值来恢复nr_swap_pages和变量的原始内容。total_swap_pagespages
调用关闭在步骤 3 中打开的文件(请参阅第 12 章中的“ close( ) 系统调用”filp_close(
)部分),并返回错误代码。
If try_to_unuse( )
fails in allocating all requested page frames, the swap area
cannot be deactivated. Therefore, the function executes the
following substeps:
Reinserts the swap area descriptor in the swap_list list and sets its
flags field to SWP_WRITEOK.
Restores the original contents of the nr_swap_pages and total_swap_pages variables by
adding the value in the pages field of the swap area
descriptor.
Invokes filp_close(
) to close the file opened in step 3 (see the
section "The
close( ) System Call" in Chapter 12), and
returns an error code.
否则,所有使用的页槽都已成功转移到 RAM。因此,该函数执行以下子步骤:
释放用于存储swap_map数组和扩展描述符的内存区域。
如果交换区存储在磁盘分区中,它将块大小恢复为其原始值,该值存储在old_block_size交换区描述符的字段中;此外,它还调用该bd_release( )函数,以便交换子系统不再持有块设备(参见 的描述中的步骤 10a sys_swapon( ))。
如果交换区存储在常规文件中,则会清除S_SWAPFILE
文件 inode 的标志。
调用filp_close(
)两次,第一次在swap_filefilp_open( )文件对象,第二次调用步骤 3返回的对象。
返回 0(成功)。
Otherwise, all used page slots have been successfully transferred to RAM. Therefore, the function executes the following substeps:
Releases the memory areas used to store the swap_map array and the extent
descriptors.
If the swap area is stored in a disk partition, it
restores the block size to its original value, which is
stored in the old_block_size field of the swap
area descriptor; moreover, it invokes the bd_release( ) function so that the
swap subsystem no longer holds the block device (see step
10a in the description of sys_swapon( )).
If the swap area is stored in a regular file, it
clears the S_SWAPFILE
flag of the file's inode.
Invokes filp_close(
) twice, the first time on the swap_file file object, the second
time on the object returned by filp_open( ) in step 3.
Returns 0 (success).
该try_to_unuse( )
函数作用于一个索引参数,该参数标识要清空的交换区;它交换页面并更新已在此交换区域中交换页面的进程的所有页表。为此,该函数访问所有内核线程的地址空间并从init_mm用作标记的内存描述符开始进行处理。这是一个耗时的函数,主要在启用中断的情况下运行。因此,与其他进程的同步至关重要。
The try_to_unuse( )
function acts on an index parameter that identifies the swap area to
be emptied; it swaps in pages and updates all the Page Tables of
processes that have swapped out pages in this swap area. To that
end, the function visits the address spaces of all kernel
threads and processes, starting with the init_mm memory descriptor that is used as
a marker. It is a time-consuming function that runs mostly with the
interrupts enabled. Synchronization with other processes is
therefore critical.
该try_to_unuse( )
函数扫描swap_map交换区的数组。当该函数找到正在使用的页槽时,它首先交换该页,然后开始查找引用该页的进程。这两个操作的顺序对于避免竞争条件至关重要。当 I/O 数据传输正在进行时,页面被锁定,因此没有进程可以访问它。一旦 I/O 数据传输完成,该页将再次被锁定try_to_unuse( ),因此它不能被另一个内核控制路径再次换出。还可以避免竞争条件,因为每个进程在开始换入或换出操作之前都会查找页面缓存(请参阅后面的“交换缓存”部分)最后,由 考虑的交换区域try_to_unuse( )被标记为不可写(SWP_WRITEOK未设置标志),因此没有进程可以在此区域的页槽上执行交换。
The try_to_unuse( )
function scans the swap_map array
of the swap area. When the function finds a in-use page slot, it
first swaps in the page, and then starts looking for the processes
that reference the page. The ordering of these two operations is
crucial to avoid race conditions. While the I/O data transfer is
ongoing, the page is locked, so no process can access it. Once the
I/O data transfer completes, the page is locked again by try_to_unuse( ), so it cannot be swapped
out again by another kernel control path. Race conditions are also
avoided because each process looks up the page cache before starting
a swap-in or swap-out operation (see the later section "The Swap Cache").
Finally, the swap area considered by try_to_unuse( ) is marked as nonwritable
(SWP_WRITEOK flag is not set), so
no process can perform a swap-out on a page slot of this
area.
但是,可能会被迫多次try_to_unuse( )
扫描交换区域的使用计数器数组。swap_map这是因为内存区域包含对换出页面的引用可能会在一次扫描期间消失,然后重新出现在进程列表中。
However, try_to_unuse( )
might be forced to scan the swap_map array of usage counters of the
swap area several times. This is because memory regions that contain references to swapped-out pages might
disappear during one scan and later reappear in the process
lists.
例如,回想一下该函数的描述(第9章“释放线性地址区间”do_munmap( )一节):每当进程释放线性地址区间时,就从进程列表中删除所有包含受影响的线性地址的内存区域;之后,该函数重新插入进程列表中仅部分取消映射的内存区域。负责释放属于已释放线性地址区间的换出页面。值得称赞的是,它不会释放属于必须重新插入进程列表的内存区域的换出页面。do_munmap( )do_munmap( )
For instance, recall the description of the do_munmap( ) function (in the section
"Releasing a Linear
Address Interval" in Chapter 9): whenever a process
releases an interval of linear addresses, do_munmap( ) removes from the process list
all memory regions that include the affected linear addresses;
later, the function reinserts the memory regions that have been only
partially unmapped in the process list. do_munmap( ) takes care of freeing the
swapped-out pages that belong to the interval of released linear
addresses. It commendably doesn't free the swapped-out pages that
belong to the memory regions that have to be reinserted in the
process list.
因此,try_to_unuse( )
可能无法找到引用给定页槽的进程,因为相应的内存区域暂时未包含在进程列表中。为了应对这一事实,try_to_unuse( )请继续扫描
swap_map数组,直到所有引用计数器都为空。最终,引用换出页面的幻影内存区域将重新出现在进程列表中,因此try_to_unuse( )将成功释放所有页槽。
Hence, try_to_unuse( )
might fail in finding a process that references a given page slot
because the corresponding memory region is temporarily not included
in the process list. To cope with this fact, try_to_unuse( ) keeps scanning the
swap_map array until all
reference counters are null. Eventually, the ghost memory regions
referencing the swapped-out pages will reappear in the process
lists, so try_to_unuse( ) will
succeed in freeing all page slots.
现在我们来描述 执行的主要操作try_to_unuse( )。swap_map它对作为其参数传递的交换区域数组中的引用计数器执行连续循环。如果当前进程收到信号,则该循环将被中断,并且该函数将返回错误代码。对于每个引用计数器,该函数执行以下步骤:
Let's describe now the major operations executed by try_to_unuse( ). It executes a continuous
loop on the reference counters in the swap_map array of the swap area passed as
its parameter. This loop is interrupted and the function returns an
error code if the current process receives a signal. For each
reference counter, the function performs the following steps:
如果计数器等于 0(那里没有存储页)或等于SWAP_MAP_BAD,则继续处理下一个页槽。
If the counter is equal to 0 (no page is stored there) or
to SWAP_MAP_BAD, it continues
with the next page slot.
否则,它会调用该函数(请参阅本章后面的“页面交换read_swap_cache_async( )”部分)来交换页面。这包括在必要时分配一个新的页框,用存储在页槽中的数据填充它,并将该页放入交换缓存中。
Otherwise, it invokes the read_swap_cache_async( ) function (see
the section "Swapping in Pages"
later in this chapter) to swap in the page. This consists of
allocating, if necessary, a new page frame, filling it with the
data stored in the page slot, and putting the page in the swap
cache.
等待新页面从磁盘正确更新并锁定它。
Waits until the new page has been properly updated from disk and locks it.
当函数执行上一步时,进程可能已挂起。因此,再次检查该页槽的引用计数器是否为空;如果是,则该交换页已被另一个内核控制路径释放,因此该函数继续处理下一个页槽。
While the function was executing the previous step, the process could have been suspended. Therefore, it checks again whether the reference counter of the page slot is null; if so, this swap page has been freed by another kernel control path, so the function continues with the next page slot.
调用unuse_process(
)双向链表中的每个内存描述符,其头为(请参阅第 9 章中的“内存描述符”init_mm部分)。这个耗时的函数扫描拥有内存描述符的进程的所有页表条目,并用页框的物理地址替换每次出现的换出页标识符。为了反映这一移动,该函数还减少数组中的页槽计数器(除非它等于
)并增加页框的使用计数器。swap_mapSWAP_MAP_MAX
Invokes unuse_process(
) on every memory descriptor in the doubly linked list
whose head is init_mm (see
the section "The
Memory Descriptor" in Chapter 9). This
time-consuming function scans all Page Table entries of the
process that owns the memory descriptor, and replaces each
occurrence of the swapped-out page identifier with the physical
address of the page frame. To reflect this move, the function
also decreases the page slot counter in the swap_map array (unless it is equal to
SWAP_MAP_MAX) and increases
the usage counter of the page frame.
调用shmem_unuse( )
以检查换出的页面是否用作IPC共享内存资源并正确处理这种情况(请参阅第19章中的“ IPC共享内存”部分)。
Invokes shmem_unuse( )
to check whether the swapped-out page is used as an IPC shared
memory resource and to properly handle that case (see the
section "IPC Shared
Memory" in Chapter
19).
检查页面引用计数器的值。如果它等于SWAP_MAP_MAX,则页槽是“永久的”。为了释放它,它强制将值 1 放入参考计数器。
Checks the value of the reference counter of the page. If
it is equal to SWAP_MAP_MAX,
the page slot is "permanent." To free it, it forces the value 1
into the reference counter.
交换缓存也可能拥有该页面(它有助于引用计数器的值)。如果该页属于交换缓存,它会调用该swap_writepage( )函数将其内容刷新到磁盘(如果该页是脏的),并调用delete_from_swap_cache( )从交换缓存中删除该页并减少其引用计数器。
The swap cache might own the page as well (it contributes
to the value of the reference counter). If the page belongs to
the swap cache, it invokes the swap_writepage( ) function to flush
its contents to disk (if the page is dirty) and invokes delete_from_swap_cache( ) to remove
the page from the swap cache and to decrease its reference
counter.
设置PG_dirty页面描述符的标志,解锁页面框架,并减少其引用计数器(以撤消步骤 5 中完成的增量)。
Sets the PG_dirty flag
of the page descriptor, unlocks the page frame, and decreases
its reference counter (to undo the increment done in step
5).
检查need_resched
当前进程的字段;如果它被设置,它会调用schedule( )放弃CPU。停用交换区是一项漫长的工作,内核必须确保系统中的其他进程仍然继续执行。try_to_unuse( )
每当调度程序再次选择该进程时,该功能都会从该步骤继续。
Checks the need_resched
field of the current process; if it is set, it invokes schedule( ) to relinquish the CPU.
Deactivating a swap area is a long job, and the kernel must
ensure that the other processes in the system still continue to
execute. The try_to_unuse( )
function continues from this step whenever the process is
selected again by the scheduler.
从步骤 1 开始继续处理下一个页槽。
Proceeds with the next page slot, starting at step 1.
该函数将继续执行,直到
swap_map数组中的每个引用计数器都为空。回想一下,即使函数开始检查下一个页槽,前一个页槽的引用计数器仍然可能为正。事实上,“幽灵”进程仍然可以引用该页面,通常是因为某些内存区域已从步骤 5 中扫描的进程列表中暂时删除。最终,捕获try_to_unuse( )每个引用。然而,与此同时,该页不再位于交换缓存中,它被解锁,并且副本仍然包含在被停用的交换区域的页槽中。
The function continues until every reference counter in the
swap_map array is null. Recall
that even if the function starts examining the next page slot, the
reference counter of the previous page slot could still be positive.
In fact, a "ghost" process could still reference the page, typically
because some memory regions have been temporarily removed from the
process list scanned in step 5. Eventually, try_to_unuse( ) catches every reference.
In the meantime, however, the page is no longer in the swap cache,
it is unlocked, and a copy is still included in the page slot of the
swap area being deactivated.
人们可能会认为这种情况可能会导致数据丢失。例如,假设某个“幽灵”进程访问页槽并开始交换页面。由于该页面不再位于交换缓存中,因此该进程会使用从磁盘读取的数据填充新的页框。然而,该页框与应该与“幽灵”进程共享页面的进程所拥有的页框不同。
One might expect that this situation could lead to data loss. For instance, suppose that some "ghost" process accesses the page slot and starts swapping the page in. Because the page is no longer in the swap cache, the process fills a new page frame with the data read from disk. However, this page frame would be different from the page frames owned by the processes that are supposed to share the page with the "ghost" process.
当停用交换区域时,不会出现此问题,因为仅当换出页面属于私有匿名内存映射时,才可能发生来自幽灵进程的干扰。[ * ]在这种情况下,页框是通过第 9 章中描述的 Copy On Write 机制来处理的,因此将不同的页框分配给引用该页的进程是完全合法的。然而,该try_to_unuse( )函数将页面标记为“脏”(步骤 9);否则,该shrink_list( )函数稍后可能会从某些进程的页表中删除该页面,而不将其保存在另一个交换区域中(请参阅后面的部分“交换出页面””)。
This problem does not arise when deactivating a swap area,
because interference from a ghost process could happen only if a
swapped-out page belongs to a private anonymous memory
mapping.[*] In this case, the page frame is handled by means of
the Copy On Write mechanism described in Chapter 9, so it is perfectly
legal to assign different page frames to the processes that
reference the page. However, the try_to_unuse( ) function marks the page as
"dirty" (step 9); otherwise, the shrink_list( ) function might later drop
the page from the Page Table of some process without saving it in an
another swap area (see the later section "Swapping Out
Pages").
正如我们稍后将看到的,当释放内存时,内核会在短时间内交换出许多页面。因此,尝试将这些页面存储在连续的槽中以最大限度地减少访问交换区域时的磁盘寻道时间非常重要。
As we will see later, when freeing memory, the kernel swaps out many pages in a short period of time. It is therefore important to try to store these pages in contiguous slots to minimize disk seek time when accessing the swap area.
搜索空闲槽位的算法的第一种方法可以选择两种简单但相当极端的策略之一:
A first approach to an algorithm that searches for a free slot could choose one of two simplistic, rather extreme strategies:
始终从交换区域的开头开始。这种方法可能会增加换出操作期间的平均寻道时间,因为空闲页槽可能分散得彼此相距很远。
Always start from the beginning of the swap area. This approach may increase the average seek time during swap-out operations, because free page slots may be scattered far away from one another.
始终从最后分配的页槽开始。如果交换区域大部分是空闲的(通常是这种情况),则这种方法会增加换入操作期间的平均寻道时间,因为少数占用的页槽可能彼此分散得很远。
Always start from the last allocated page slot. This approach increases the average seek time during swap-in operations if the swap area is mostly free (as is usually the case), because the handful of occupied page slots may be scattered far away from one another.
Linux 采用混合方法。它始终从最后分配的页槽开始,除非发生以下情况之一:
Linux adopts a hybrid approach. It always starts from the last allocated page slot unless one of these conditions occurs:
已到达交换区域的末尾。
The end of the swap area is reached.
SWAPFILE_CLUSTER(通常是 256)个空闲页槽是在上次重启后从交换区的开头分配的。
SWAPFILE_CLUSTER (usually
256) free page slots were allocated after the last restart from
the beginning of the swap area.
cluster_nr描述符中的字段
存储swap_info_struct分配的空闲页槽的数量。当函数从交换区的开头重新开始分配时,该字段重置为 0。该cluster_next字段存储下一次分配时要检查的第一个页槽的索引。[ * ]
The cluster_nr field in the
swap_info_struct descriptor stores
the number of free page slots allocated. This field is reset to 0 when
the function restarts allocation from the beginning of the swap area.
The cluster_next field stores the
index of the first page slot to be examined in the next
allocation.[*]
为了加快空闲页槽的搜索速度,内核使
每个交换区域描述符的lowest_bit和highest_bit字段保持最新。这些字段指定第一个和最后一个可以空闲的页槽;换句话说,下面
lowest_bit和上面的每个页槽highest_bit都被占用。
To speed up the search for free page slots, the kernel keeps the
lowest_bit and highest_bit fields of each swap area
descriptor up-to-date. These fields specify the first and the last
page slots that could be free; in other words, every page slot below
lowest_bit and above highest_bit is known to be occupied.
该scan_swap_map( )
函数用于在给定的交换区域中查找空闲页槽。它作用于单个参数,该参数指向交换区域描述符并返回空闲页槽的索引。如果交换区不包含任何空闲槽位,则返回 0。该函数执行以下步骤:
The scan_swap_map( )
function is used to find a free page slot in a given swap area. It
acts on a single parameter, which points to a swap area descriptor
and returns the index of a free page slot. It returns 0 if the swap
area does not contain any free slots. The function performs the
following steps:
它首先尝试使用当前集群。如果cluster_nr交换区描述符的字段为正,则它swap_map从索引处的元素开始扫描计数器数组cluster_next并查找空条目。如果发现空条目,则会减少该cluster_nr字段并转到步骤 4。
It tries first to use the current cluster. If the cluster_nr field of the swap area
descriptor is positive, it scans the swap_map array of counters starting
from the element at index cluster_next and looks for a null
entry. If a null entry is found, it decreases the cluster_nr field and goes to step
4.
如果到达此点,则该cluster_nr字段为空,或者从 开始的搜索cluster_next
在数组中未找到空条目swap_map。是时候尝试混合搜索的第二阶段了。该函数重新初始化
cluster_nr并SWAPFILE_CLUSTER重新从索引扫描数组lowest_bit
,尝试找到一组SWAPFILE_CLUSTER空闲页槽。如果找到这样的组,则转到步骤 4。
If this point is reached, either the cluster_nr field is null or the search
starting from cluster_next
didn't find a null entry in the swap_map array. It is time to try the
second stage of the hybrid search. The function reinitializes
cluster_nr to SWAPFILE_CLUSTER and restarts scanning
the array from the lowest_bit
index trying to find a group of SWAPFILE_CLUSTER free page slots. If
such a group is found, it goes to step 4.
不存在空闲页槽组SWAPFILE_CLUSTER。该函数从索引重新开始扫描数组,
lowest_bit尝试找到单个空闲页槽。如果没有找到空条目,则将该lowest_bit字段设置为数组中的最大索引,将该highest_bit字段设置为 0,并返回 0(交换区已满)。
No group of SWAPFILE_CLUSTER free page slots
exists. The function restarts scanning the array from the
lowest_bit index trying to
find a single free page slot. If no null entry is found, it sets
the lowest_bit field to the
maximum index in the array, the highest_bit field to 0, and returns 0
(the swap area is full).
发现一个空条目。将值 1 放入条目中,减少,并在必要时nr_swap_pages更新lowest_bit和
字段,将该字段加一,并将该
字段设置为刚刚分配的页槽的索引加 1。highest_bitinuse_pagescluster_next
A null entry is found. Puts the value 1 in the entry,
decreases nr_swap_pages,
updates the lowest_bit and
highest_bit fields if
necessary, increases the inuse_pages field by one, and sets the
cluster_next field to the
index of the page slot just allocated plus 1.
返回分配的页槽的索引。
Returns the index of the allocated page slot.
该get_swap_page(
)函数用于通过搜索所有活动交换区来查找空闲页槽。该函数考虑到活动交换区域的不同优先级,返回新分配的页槽的换出页标识符,如果所有交换区域都已填满,则返回 0。
The get_swap_page(
) function is used to find a free page slot by searching
all the active swap areas. The function, which returns the
swapped-out page identifier of a newly allocated page slot or 0 if
all swap areas are filled, takes into consideration the different
priorities of the active swap areas.
完成两次传递是为了在很容易找到页槽时最大限度地减少运行时间。第一遍是部分的,仅适用于具有单一优先级的区域;该函数以循环方式搜索这些区域以寻找空闲插槽。如果没有找到空闲页槽,则从交换区列表的开头开始进行第二遍;在第二遍期间,将检查所有交换区域。更准确地说,该函数执行以下步骤:
Two passes are done in order to minimize runtime when it's easy to find a page slot. The first pass is partial and applies only to areas that have a single priority; the function searches such areas in a Round Robin fashion for a free slot. If no free page slot is found, a second pass is made starting from the beginning of the swap area list; during this second pass, all swap areas are examined. More precisely, the function performs the following steps:
如果nr_swap_pages为 null 或者没有活动的交换区域,则返回 0。
If nr_swap_pages is
null or if there are no active swap areas, it returns 0.
首先考虑 指向的交换区域swap_list.next(回想一下,交换区域列表是按优先级递减排序的)。
Starts by considering the swap area pointed to by swap_list.next (recall that the swap
area list is sorted by decreasing priorities).
如果交换区域处于活动状态,则会调用scan_swap_map( )分配空闲页槽。如果scan_swap_map(
)返回页槽索引,则该函数的工作基本上已完成,但必须为下一次调用做好准备。因此,如果交换区域列表中的下一个交换区域具有相同的优先级,则它会更新swap_list.next为指向下一个交换区域(从而继续循环使用这些交换区域)。如果下一个交换区与当前交换区的优先级不同,则该函数设置swap_list.next到列表中的第一个交换区域(以便下一次搜索将从具有最高优先级的交换区域开始)。该函数通过返回与刚刚分配的页槽相对应的换出页标识符来完成。
If the swap area is active, it invokes scan_swap_map( ) to allocate a free
page slot. If scan_swap_map(
) returns a page slot index, the function's job is
essentially done, but it must prepare for its next invocation.
Thus, it updates swap_list.next to point to the next
swap area in the swap area list, if the latter has the same
priority (thus continuing the round-robin use of these swap
areas). If the next swap area does not have the same priority as
the current one, the function sets swap_list.next to the first swap area
in the list (so that the next search will start with the swap
areas that have the highest priority). The function finishes by
returning the swapped-out page identifier corresponding to the
page slot just allocated.
交换区域不可写,或者没有空闲页槽。如果交换区列表中的下一个交换区与当前交换区具有相同的优先级,则该函数将其设为当前交换区并转到步骤 3。
Either the swap area is not writable, or it does not have free page slots. If the next swap area in the swap area list has the same priority as the current one, the function makes it the current one and goes to step 3.
此时,交换区列表中的下一个交换区的优先级低于前一个交换区。下一步取决于函数正在执行的两个过程中的哪一个。
如果这是第一遍(部分),它会考虑列表中的第一个交换区域并转到步骤 3,从而开始第二遍。
否则,它检查列表中是否有下一个元素;如果是,则考虑并转到步骤 3。
At this point, the next swap area in the swap area list has a lower priority than the previous one. The next step depends on which of the two passes the function is performing.
If this is the first (partial) pass, it considers the first swap area in the list and goes to step 3, thus starting the second pass.
Otherwise, it checks if there is a next element in the list; if so, it considers it and goes to step 3.
此时列表已经被第二遍扫描完整,没有发现空闲页槽;它返回 0。
At this point the list is completely scanned by the second pass and no free page slot has been found; it returns 0.
swap_free( )
当交换页面时调用该函数以减少相应的swap_map计数器(见表17-3)。当计数器达到 0 时,页槽变为空闲,因为其标识符不再包含在任何页表条目中。然而,我们将在后面的“交换缓存”部分中看到交换缓存算作页槽的所有者。
The swap_free( )
function is invoked when swapping in a page to decrease the
corresponding swap_map counter
(see Table 17-3).
When the counter reaches 0, the page slot becomes free since its
identifier is no longer included in any Page Table entry. We'll see
in the later section "The Swap Cache,"
however, that the swap cache counts as an owner of the page slot.
该函数作用于指定换出页面标识符的单个entry参数,并执行以下步骤:
The function acts on a single entry parameter that specifies a
swapped-out page identifier and performs the following steps:
从参数中导出交换区索引和offset页槽索引
entry,并获取交换区描述符的地址。
Derives the swap area index and the offset page slot index from the
entry parameter and gets the
address of the swap area descriptor.
检查交换区是否处于活动状态,如果不活动则立即返回。
Checks whether the swap area is active and returns right away if it is not.
如果swap_map与正在释放的页槽相对应的计数器小于
SWAP_MAP_MAX,则该函数会减少它。回想一下,具有该SWAP_MAP_MAX值的条目被认为是持久的(不可删除的)。
If the swap_map counter
corresponding to the page slot being freed is smaller than
SWAP_MAP_MAX, the function
decreases it. Recall that entries that have the SWAP_MAP_MAX value are considered
persistent (undeletable).
如果swap_map计数器变为 0,则该函数会增加 的值nr_swap_pages,减少该inuse_pages字段,并在必要时更新交换区域描述符的lowest_bit和
字段。highest_bit
If the swap_map counter
becomes 0, the function increases the value of nr_swap_pages, decreases the inuse_pages field, and updates, if
necessary, the lowest_bit and
highest_bit fields of the
swap area descriptor.
将页面传输到交换区域或从交换区域传输页面是一项可能引发许多竞争条件的活动。特别是,交换子系统必须小心处理以下情况:
Transferring pages to and from a swap area is an activity that can induce many race conditions. In particular, the swapping subsystem must handle carefully the following cases:
两个进程可能同时尝试交换同一个共享匿名页面。
Two processes may concurrently try to swap in the same shared anonymous page.
进程可以换入 PFRA 换出的页面。
A process may swap-in a page that is being swapped out by the PFRA.
引入交换缓存是为了解决此类同步问题。关键规则是,在不检查交换缓存是否已包含受影响的页面的情况下,任何人都无法启动换入或换出。得益于交换缓存,影响同一页的并发交换操作始终作用于同一页框;因此,内核可以安全地依赖PG_locked页面描述符的标志来避免任何竞争条件。
The swap cache has been introduced to solve
these kinds of synchronization problems. The key rule is that nobody
can start a swap-in or swap-out without checking whether the swap
cache already includes the affected page. Thanks to the swap cache,
concurrent swap operations affecting the same page always act on the
same page frame; therefore, the kernel may safely rely on the PG_locked flag of the page descriptor to
avoid any race condition.
例如,考虑共享同一换出页面的两个进程。当第一个进程尝试访问该页面时,内核启动换入操作。第一步包括检查页框是否已包含在交换缓存中。假设不是:然后,内核分配一个新的页框并将其插入到交换缓存中;接下来,它开始 I/O 操作,从交换区读取页面的内容。同时,第二个进程访问共享匿名页面。如上所述,内核启动换入操作并检查受影响的页框是否已包含在交换缓存中。现在,它被包括在内,PG_locked标志被清除,即直到 I/O 数据传输完成。
For example, consider two processes that share the same
swapped-out page. When the first process tries to access the page, the
kernel starts the swap-in operation. The very first step consists of
checking whether the page frame is already included in the swap cache.
Let's suppose it isn't: then, the kernel allocates a new page frame
and inserts it into the swap cache; next, it starts the I/O operation
to read the page's contents from the swap area. Meanwhile, the second
process accesses the shared anonymous page. As above, the kernel
starts a swap-in operation and checks whether the affected page frame
is already included in the swap cache. Now, it is included, thus the
kernel simply accesses the page frame descriptor and puts the current
process to sleep until the PG_locked flag is cleared, that is, until
the I/O data transfer completes.
当并发换入和换出操作混合时,交换缓存也起着至关重要的作用。正如本章前面的“低内存回收”部分所述,仅当成功从拥有该页的所有进程的用户模式页表中删除该页框时,该shrink_list( )函数才会开始换出匿名页。try_to_unmap(
)然而,这些进程之一可能会访问该页并导致换入,而换出写入操作仍在进行中。
The swap cache plays a crucial role also when concurrent swap-in
and swap-out operations mix up. As explained in the section "Low On Memory
Reclaiming" earlier in this chapter, the shrink_list( ) function starts swapping out
an anonymous page only if try_to_unmap(
) succeeds in removing the page frame from the User Mode
Page Tables of all processes that own the page. However, one of these
processes may access the page and cause a swap-in while the swap-out
write operation is still in progress.
在写入磁盘之前,每个要换出的页面都由 存储在交换缓存中shrink_list(
)。考虑在两个进程 A 和 B 之间共享的页 P。最初,两个进程的页表条目都包含对页框的引用,并且该页有两个所有者;这种情况如图17-8 (a)所示。当 PFRA 选择要回收的页时,
shrink_list( )将页框插入交换缓存中。如图17-8 (b)所示,现在页框有三个所有者,而交换区中的页槽仅被交换缓存引用。接下来,PFRA 调用try_to_unmap( )从进程的页表中删除对页框的引用;一旦该函数终止,页框仅由交换缓存引用,而页槽则由两个进程和交换缓存引用,如图17-8(c)所示。假设当页面的内容写入磁盘时,进程 B 访问该页面,即它尝试使用页面内的线性地址访问内存单元。然后,页错误处理程序在交换缓存中找到该页框,并将其物理地址放回到进程B的页表项中,如图17-8所示(d). 相反,如果换出操作在没有并发换入操作的情况下终止,则该函数将从交换缓存中删除该页帧并将该页帧释放给Buddy系统,如图17-8(e)shrink_list( )
所示。
Before being written to disk, each page to be swapped out is
stored in the swap cache by shrink_list(
). Consider a page P that is shared among two processes, A
and B. Initially, the Page Table entries of both processes contain a
reference to the page frame, and the page has two owners; this case is
illustrated in Figure
17-8(a). When the PFRA selects the page for reclaiming,
shrink_list( ) inserts the page
frame in the swap cache. As illustrated in Figure 17-8(b), now the
page frame has three owners, while the page slot in the swap area is
referenced only by the swap cache. Next, the PFRA invokes try_to_unmap( ) to remove the references to
the page frame from the Page Table of the processes; once this
function terminates, the page frame is referenced only by the swap
cache, while the page slot is referenced by the two processes and the
swap cache, as illustrated in Figure 17-8(c). Let's
suppose that, while the page's contents are being written to disk,
process B accesses the page—that is, it tries to access a memory cell
using a linear address inside the page. Then, the page fault handler
finds the page frame in the swap cache and puts back its physical
address in the Page Table entry of process B, as illustrated in Figure 17-8(d). Conversely,
if the swap-out operation terminates without concurrent swap-in
operations, the shrink_list( )
function removes the page frame from the swap cache and releases the
page frame to the Buddy system, as illustrated in Figure 17-8(e).
您可以将交换缓存视为包含当前换入或换出的匿名页面的页面描述符的中转区域。当换入或换出终止时(对于共享匿名页面,必须在共享该页面的所有进程上执行过换入或换出),匿名页面的页面描述符可能是从交换缓存中删除。[ * ]
You might consider the swap cache as a transit area containing the page descriptors of anonymous pages that are being currently swapped-in or swapped out. When the swap-in or swap-out terminates (in the case of shared anonymous pages, the swap-in or swap-out must have been performed on all the processes that share the page), the page descriptor of the anonymous page may be removed from the swap cache.[*]
交换缓存是通过页面缓存数据结构和过程来实现的,这在第15章的“页面缓存”部分中进行了描述。
回想一下,页面缓存的核心是一组基数树,它允许算法从标识页面所有者的对象地址以及偏移值快速导出页面描述符的地址。address_space
The swap cache is implemented by the page cache data
structures and procedures, which are described in the section "The Page Cache" in Chapter 15. Recall that the
core of the page cache is a set of radix trees that allows the
algorithm to quickly derive the address of a page descriptor from
the address of an address_space
object identifying the owner of the page as well as from an offset
value.
交换缓存中的页面与页面缓存中的每个其他页面一样存储,并进行以下特殊处理:
Pages in the swap cache are stored as every other page in the page cache, with the following special treatment:
mapping页面描述符的字段设置为NULL。
The mapping field of
the page descriptor is set to NULL.
PG_swapcache设置页面描述符的标志。
The PG_swapcache flag
of the page descriptor is set.
该private字段存储与该页面关联的换出页面标识符。
The private field
stores the swapped-out page identifier associated with the
page.
此外,当页面被放入交换高速缓存时,
count页面描述符的字段和页槽使用计数器都会增加,因为交换高速缓存同时使用页框和页槽。
Moreover, when the page is put in the swap cache, both the
count field of the page
descriptor and the page slot usage counters are increased, because
the swap cache uses both the page frame and the page slot.
最后,单个swapper_space地址空间用于交换缓存中的所有页面,因此指向
swapper_space.page_tree交换缓存中的页面的单个基数树。nrpages地址空间字段存储swapper_space交换缓存中包含的页数。
Finally, a single swapper_space address space is used for
all pages in the swap cache, so a single radix tree pointed to by
swapper_space.page_tree addresses
the pages in the swap cache. The nrpages field of the swapper_space address space stores the
number of pages contained in the swap cache.
内核使用几个函数来处理交换缓存;它们主要基于第 15 章“页面缓存”部分中讨论的内容。稍后我们将展示较高级别的函数如何调用这些相对较低级别的函数来根据需要换入和换出页面。
The kernel uses several functions to handle the swap cache; they are based mainly on those discussed in the section "The Page Cache" in Chapter 15. We show later how these relatively low-level functions are invoked by higher-level functions to swap pages in and out as needed.
处理交换缓存的主要函数是:
The main functions that handle the swap cache are:
lookup_swap_cache(
)lookup_swap_cache(
)通过作为参数传递的换出页面标识符在交换缓存中查找页面,并返回页面描述符地址。如果缓存中不存在该页,则返回 0。为了找到所需的页面,它调用radix_tree_lookup( ),将一个指向(用于交换缓存中页面的基数树)的指针swapper_space.page_tree和换出的页面标识符作为参数传递。
Finds a page in the swap cache through its swapped-out
page identifier passed as a parameter and returns the page
descriptor address. It returns 0 if the page is not present in
the cache. To find the required page, it invokes radix_tree_lookup( ), passing as
parameters a pointer to swapper_space.page_tree—the radix
tree used for pages in the swap cache—and the swapped-out page
identifier.
add_to_swap_cache(
)add_to_swap_cache(
)将页面插入交换缓存。它本质上是调用swap_duplicate( )
检查作为参数传递的页槽是否有效并增加页槽使用计数器;然后,调用
radix_tree_insert( )将页面插入到缓存中;最后,它增加页面的引用计数器并设置PG_swapcache和PG_locked标志。
Inserts a page into the swap cache. It essentially
invokes swap_duplicate( )
to check whether the page slot passed as a parameter is valid
and to increase the page slot usage counter; then, it invokes
radix_tree_insert( ) to
insert the page into the cache; finally, it increases the
page's reference counter and sets the PG_swapcache and PG_locked flags.
_ _add_to_swap_cache(
)_ _add_to_swap_cache(
)与 类似,只不过该函数在将页框插入交换缓存之前add_to_swap_cache(
)不会调用。swap_duplicate( )
Similar to add_to_swap_cache(
), except that the function does not invoke swap_duplicate( ) before inserting
the page frame in the swap cache.
delete_from_swap_cache(
)delete_from_swap_cache(
)通过调用从交换缓存中删除页面radix_tree_delete( ),减少 中相应的使用计数器swap_map,并减少页面引用计数器。
Removes a page from the swap cache by invoking radix_tree_delete( ), decreases the
corresponding usage counter in swap_map, and decreases the page
reference counter.
free_page_and_swap_cache(
)free_page_and_swap_cache(
)current如果没有用户模式进程引用相应的页槽,则从交换缓存中删除页面,并减少该页面的使用计数器。
Removes a page from the swap cache if no User Mode
process besides current is
referencing the corresponding page slot, and decreases the
page's usage counter.
free_pages_and_swap_cache(
)free_pages_and_swap_cache(
)类似于free_page_and_swap_cache( ),但在一组页面上运行。
Analogous to free_page_and_swap_cache( ), but
operates on a set of pages.
free_swap_and_cache(
)free_swap_and_cache(
)释放交换条目,并检查该条目引用的页面是否在交换缓存中。如果除 之外没有任何用户模式进程current正在引用该页面,或者超过 50% 的交换条目繁忙,则该函数会从交换缓存中删除该页面。
Frees a swap entry, and checks whether the page
referenced by the entry is in the swap cache. If either no
User Mode process, besides current, is referencing the page or
more than 50% of the swap entries are busy, the function
removes the page from the swap cache.
我们在本章前面的“低内存回收”部分中已经了解了PFRA 如何确定是否应换出给定的匿名页面。在本节中,我们将展示内核如何执行换出。
We have seen in the section "Low On Memory Reclaiming" earlier in this chapter how the PFRA determines whether a given anonymous page should be swapped out. In this section we show how the kernel performs a swap-out.
换出操作的第一步包括准备交换缓存。如果shrink_list(
)函数确定页面是匿名的(
PageAnon( )函数返回 1)并且交换缓存不包含相应的页框(PG_swapcache页面描述符中的标志为清除),则内核将调用该add_to_swap( )函数。
The first step of a swap-out operation consists of preparing
the swap cache. If the shrink_list(
) function determines that a page is anonymous (the
PageAnon( ) function returns 1)
and that the swap cache does not include the corresponding page
frame (the PG_swapcache flag in
the page descriptor is clear), the kernel invokes the add_to_swap( ) function.
该add_to_swap( )函数在交换区域中分配一个新的页槽,并在交换缓存中插入一个页框(其页描述符地址作为其参数传递)。本质上,该函数执行以下步骤:
The add_to_swap( ) function
allocates a new page slot in a swap area and inserts a page
frame—whose page descriptor address is passed as its parameter—in
the swap cache. Essentially, the function performs the following
steps:
调用get_swap_page(
)分配新的页槽;请参阅本章前面的“分配和释放页槽”部分。如果失败(例如,未找到空闲页槽),则返回 0。
Invokes get_swap_page(
) to allocate a new page slot; see the section "Allocating and Releasing
a Page Slot" earlier in this chapter. Returns 0 in case
of failure (for example, no free page slot found).
调用_ _add_to_page_cache(
),向其传递页槽索引、页描述符地址和一些分配标志。
Invokes _ _add_to_page_cache(
), passing to it the page slot index, the page
descriptor address, and some allocation flags.
在页面描述符中设置PG_uptodate
和PG_dirty标志,以便该shrink_list( )函数将被迫将页面写入磁盘(请参阅下一节)。
Sets the PG_uptodate
and PG_dirty flags in the
page descriptor, so that the shrink_list( ) function will be forced
to write the page to disk (see the next section).
返回 1(成功)。
Returns 1 (success).
一旦add_to_swap( )
终止,shrink_list( )
调用try_to_unmap( ),它确定引用匿名页面的每个用户模式页表条目的地址,并向其中写入换出的页面标识符;这在本章前面的“匿名页面的反向映射”部分中进行了描述。
Once add_to_swap( )
terminates, shrink_list( )
invokes try_to_unmap( ), which
determines the address of every User Mode page table entry referring
to the anonymous page and writes into it a swapped-out page
identifier; this is described in the section "Reverse Mapping for Anonymous
Pages" earlier in this chapter.
完成换出要执行的下一个操作包括将页面的内容写入交换区域。该 I/O 传输由该函数激活shrink_list( ),该函数检查PG_dirty页框标志是否已设置,然后执行该pageout( )函数(参见本章前面的图 17-5 )。
The next action to be performed to complete the
swap-out consists of writing the page's contents into the swap area.
This I/O transfer is activated by the shrink_list( ) function, which checks
whether the PG_dirty flag of the
page frame is set and consequently executes the pageout( ) function (see Figure 17-5 earlier in
this chapter).
正如本章前面的“低内存回收”一节中所解释的,该pageout( )函数设置一个writeback_control描述符并调用writepage页面address_space对象的方法。writepage对象的方法是
swapper_state通过
swap_writepage( )
函数来实现的。
As explained in the section "Low On Memory
Reclaiming" earlier in this chapter, the pageout( ) function sets up a writeback_control descriptor and invokes
the writepage method of the
page's address_space object. The
writepage method of the swapper_state object is implemented by the
swap_writepage( )
function.
该swap_writepage( )
函数基本上执行以下步骤:
The swap_writepage( )
function, in turn, performs essentially the following steps:
检查是否至少有一个用户模式进程正在引用该页面。如果不是,它会从交换缓存中删除该页面并返回 0。此检查是必要的,因为进程可能会与 PRFA 竞争并在执行检查后释放页面shrink_list(
)。
Checks whether at least one User Mode process is
referencing the page. If not, it removes the page from the swap
cache and returns 0. This check is necessary because a process
might race with the PRFA and release a page after the check
performed by shrink_list(
).
调用get_swap_bio( )
分配和初始化描述符(参见第 14 章中的“ Bio 结构”bio部分)。该函数从换出页标识符导出交换区描述符的地址;然后,它遍历交换区列表以确定页槽的初始磁盘扇区。描述
符将包括对单页数据(页槽)的请求;完成方法设置为函数。bioend_swap_bio_write( )
Invokes get_swap_bio( )
to allocate and initialize a bio descriptor (see the section "The Bio Structure"
in Chapter 14). The
function derives the address of the swap area descriptor from
the swapped-out page identifier; then, it walks the swap extent
lists to determine the initial disk sector of the page slot. The
bio descriptor will include a
request for a single page of data (the page slot); the
completion method is set to the end_swap_bio_write( ) function.
设置页面描述符中的标志和交换缓存基数树中的写回标记(请参阅第 15 章中的“基数树的标记”PG_writeback
部分)。此外,该函数还重置标志。PG_locked
Sets the PG_writeback
flag in the page descriptor and the writeback tags in the swap
cache's radix tree (see the section "The Tags of the Radix
Tree" in Chapter
15). Moreover, the function resets the PG_locked flag.
调用submit_bio( ),向其传递WRITE
命令和bio
描述符地址。
Invokes submit_bio( ),
passing to it the WRITE
command and the bio
descriptor address.
返回 0。
Returns 0.
一旦 I/O 数据传输终止,该end_swap_bio_write( )函数就会被执行。本质上,该函数唤醒任何等待PG_writeback页面标志被清除的进程,清除PG_writeback标志和基数树中相应的标记,并释放bio用于 I/O 传输的描述符。
Once the I/O data transfer terminates, the end_swap_bio_write( ) function is
executed. Essentially, this function wakes up any process waiting
until the PG_writeback flag of
the page is cleared, clears the PG_writeback flag and the corresponding
tags in the radix tree, and releases the bio descriptor used for the I/O
transfer.
换出操作的最后一步再次执行shrink_list( ):如果它验证在进行 I/O 数据传输时没有进程尝试访问页框,则它实际上会调用delete_from_swap_cache( )从交换缓存中删除页框。因为交换缓存是该页面的唯一所有者,所以该页面框架被释放给伙伴系统。
The last step of the swap-out operation is performed once more
by shrink_list( ): if it verifies
that no process has tried to access the page frame while doing the
I/O data transfer, it essentially invokes delete_from_swap_cache( ) to remove the
page frame from the swap cache. Because the swap cache was the only
owner of the page, the page frame is released to the buddy
system.
当进程尝试寻址已换出到磁盘的页面时,就会发生换入。页面错误当发生以下情况时,异常处理程序会触发换入操作(请参阅第 9 章中的“处理地址空间内的错误地址”部分):
Swap-in takes place when a process attempts to address a page that has been swapped out to disk. The Page Fault exception handler triggers a swap-in operation when the following conditions occur (see the section "Handling a Faulty Address Inside the Address Space" in Chapter 9):
包含引起异常的地址的页是有效的——也就是说,它属于当前进程的内存区域。
The page including the address that caused the exception is a valid one—that is, it belongs to a memory region of the current process.
该页不存在于内存中——也就是说,Present页表条目中的标志被清除。
The page is not present in memory—that is, the Present flag in the Page Table entry is
cleared.
与该页关联的页表项不为空,但该Dirty位已清零;这意味着该条目包含一个换出的页面标识符(请参阅第 9 章中的“请求分页”
部分)。
The Page Table entry associated with the page is not null,
but the Dirty bit is clear;
this means that the entry contains a swapped-out page identifier
(see the section "Demand Paging" in
Chapter 9).
如果满足上述所有条件,handle_pte_fault( )则调用一个非常方便的
do_swap_page( )函数来交换所需的页面。
If all the above conditions are satisfied, handle_pte_fault( ) invokes a quite handy
do_swap_page( ) function to swap in
the page required.
The do_swap_page( )
function acts on the following parameters:
mmmm引起缺页异常的进程的内存描述符地址
Memory descriptor address of the process that caused the Page Fault exception
vmavma包含以下区域的内存区域描述符地址address
Memory region descriptor address of the region that
includes address
addressaddress导致异常的线性地址
Linear address that causes the exception
page_tablepage_table映射的页表条目的地址address
Address of the Page Table entry that maps address
pmdpmd映射的页面中间目录地址address
Address of the Page Middle Directory that maps address
orig_pteorig_pte映射的页表条目的内容address
Content of the Page Table entry that maps address
write_accesswrite_access表示尝试的访问是读取还是写入的标志
Flag denoting whether the attempted access was a read or a write
与其他函数相反,do_swap_page( )从不返回 0。如果该页已在交换缓存中(小故障),则返回 1;如果从交换区域读取该页(大故障),则返回 2;如果执行时发生错误,则返回 -1换入。它本质上执行以下步骤:
Contrary to other functions, do_swap_page( ) never returns 0. It
returns 1 if the page is already in the swap cache (minor fault), 2
if the page was read from the swap area (major fault), and -1 if an
error occurred while performing the swap-in. It essentially executes
the following steps:
从 获取换出的页面标识符orig_pte。
Gets the swapped-out page identifier from orig_pte.
调用pte_unmap( )以释放该函数创建的页表的任何临时内核映射(请参阅第 9 章中的“处理地址空间内的错误地址”handle_mm_fault( )
部分)。正如第 8 章“高端内存页帧的内核映射”部分所述,访问高端内存中的页表需要内核映射。
Invokes pte_unmap( ) to
release any temporary kernel mapping for the Page Table created
by the handle_mm_fault( )
function (see the section "Handling a Faulty Address
Inside the Address Space" in Chapter 9). As explained in
the section "Kernel
Mappings of High-Memory Page Frames" in Chapter 8, a kernel mapping
is required to access a page table in high memory.
释放page_table_lock内存描述符的自旋锁(它是由调用函数获取的
handle_pte_fault( ))。
Releases the page_table_lock spin lock of the
memory descriptor (it was acquired by the caller function
handle_pte_fault( )).
调用检查lookup_swap_cache(
)交换缓存是否已经包含与换出的页面标识符对应的页面;如果该页已在交换缓存中,则跳转到步骤 6。
Invokes lookup_swap_cache(
) to check whether the swap cache already contains a
page corresponding to the swapped-out page identifier; if the
page is already in the swap cache, it jumps to step 6.
调用该swapin_readahead(
)函数从交换区读取一组最多 2 n个页面,包括请求的页面。值n存储在page_cluster变量中,通常等于 3。[ * ]通过调用函数来读取每一页read_swap_cache_async( )(见下文)。
Invokes the swapin_readahead(
) function to read from the swap area a group of at
most 2n pages, including the requested one.
The value n is stored in the page_cluster variable, and is usually
equal to 3.[*] Each page is read by invoking the read_swap_cache_async( ) function (see
below).
再次调用read_swap_cache_async(
)以精确交换导致页面错误的进程所访问的页面。此步骤可能看起来多余,但事实并非如此。该swapin_readahead( )函数可能无法读取请求的页面,例如,因为
page_cluster设置为 0 或该函数尝试读取包含空闲页槽或有缺陷页槽 ( SWAP_MAP_BAD) 的一组页。另一方面,如果
swapin_readahead( )
成功,则此调用会read_swap_cache_async( )很快终止,因为它在交换缓存中找到了该页。
Invokes read_swap_cache_async(
) once more to swap in precisely the page accessed by
the process that caused the Page Fault. This step might appear
redundant, but it isn't really. The swapin_readahead( ) function might
fail in reading the requested page—for instance, because
page_cluster is set to 0 or
the function tried to read a group of pages including a free
page slot or a defective page slot (SWAP_MAP_BAD). On the other hand, if
swapin_readahead( )
succeeded, this invocation of read_swap_cache_async( ) terminates
quickly because it finds the page in the swap cache.
如果尽管付出了一切努力,所请求的页面仍未添加到交换缓存中,则另一个内核控制路径可能已经代表该进程的克隆在所请求的页面中进行了交换。page_table_lock通过临时获取自旋锁并比较与 指向的条目来page_table检查这种情况
orig_pte。如果它们不同,则该页已被其他内核控制路径换入,因此该函数返回 1(小错误);否则,返回-1(失败)。
If, despite all efforts, the requested page was not added
to the swap cache, another kernel control path might have
already swapped in the requested page on behalf of a clone of
this process. This case is checked by temporarily acquiring the
page_table_lock spin lock and
comparing the entry to which page_table points with orig_pte. If they differ, the page has
already been swapped in by some other kernel control path, so
the function returns 1 (minor fault); otherwise, it returns -1
(failure).
此时,我们知道该页面位于交换缓存中。如果页面已被有效交换(严重错误),则该函数将调用grab_swap_token(
)以尝试获取交换令牌(请参阅本章前面的“交换令牌”部分)。
At this point, we know that the page is in the swap cache.
If the page has been effectively swapped in (major fault), the
function invokes grab_swap_token(
) to try to grab the swap token (see the section
"The Swap
Token" earlier in this chapter).
调用(请参阅前面的“最近最少使用 (LRU) 列表mark_page_accessed(
)”部分)并锁定页面。
Invokes mark_page_accessed(
) (see the earlier section "The Least Recently Used
(LRU) Lists") and locks the page.
获取page_table_lock自旋锁。
Acquires the page_table_lock spin lock.
检查另一个内核控制路径是否已代表此进程的克隆交换到所请求的页面中。在这种情况下,它释放page_table_lock自旋锁,解锁页面,并返回 1(轻微错误)。
Checks whether another kernel control path has swapped in
the requested page on behalf of a clone of this process. In this
case, it releases the page_table_lock spin lock, unlocks the
page, and returns 1 (minor fault).
调用swap_free( )以减少与 对应的页槽的使用计数器
entry。
Invokes swap_free( ) to
decrease the usage counter of the page slot corresponding to
entry.
检查交换缓存是否至少已满 50%(nr_swap_pages小于一半total_swap_pages)。如果是,它会检查该页面是否仅由导致错误的进程(或其克隆之一)拥有;如果是这种情况,请从交换缓存中删除该页面。
Checks whether the swap cache is at least 50 percent full
(nr_swap_pages is smaller
than half of total_swap_pages). If so, it checks
whether the page is owned only by the process that caused the
fault (or one of its clones); if this is the case, removes the
page from the swap cache.
增加rss进程内存描述符的字段。
Increases the rss field
of the process's memory descriptor.
更新页表条目,以便进程可以找到该页。vm_page_prot该函数通过将所请求页面的物理地址和在内存区域字段中找到的保护位写入由 寻址的页表条目来实现此目的page_table。此外,如果导致故障的访问是写入并且故障进程是该页的唯一所有者,则该函数还设置标志
Dirty和Read/Write标志以防止无用的写入时复制故障。
Updates the Page Table entry so the process can find the
page. The function accomplishes this by writing the physical
address of the requested page and the protection bits found in
the vm_page_prot field of the
memory region into the Page Table entry addressed by page_table. Moreover, if the access
that caused the fault was a write and the faulting process is
the unique owner of the page, the function also sets the
Dirty flag and the Read/Write flag to prevent a useless
Copy On Write fault.
解锁页面。
Unlocks the page.
调用将匿名页面插入到基于对象的反向映射数据结构中(请参阅本章前面的“匿名页面的反向映射page_add_anon_rmap(
)”部分。)
Invokes page_add_anon_rmap(
) to insert the anonymous page in the object-based
reverse mapping data structures (see the section "Reverse Mapping for
Anonymous Pages" earlier in this chapter.)
如果write_access
参数等于 1,则调用该函数do_wp_page( )来复制页框(请参阅第 9 章中的“写入时复制”
部分)。
If the write_access
parameter is equal to 1, the function invokes do_wp_page( ) to make a copy of the
page frame (see the section "Copy On Write" in
Chapter 9).
释放mm->page_table_lock自旋锁并返回ret返回码:1(轻微故障)或2(严重故障)。
Releases the mm->page_table_lock spin lock and
returns the ret return code:
1 (minor fault) or 2 (major fault).
read_swap_cache_async(
)每当内核必须交换页面时就会调用该函数。它作用于三个参数:
The read_swap_cache_async(
) function is invoked whenever the kernel must swap in a
page. It acts on three parameters:
entryentry换出的页面标识符
A swapped-out page identifier
vmavma指向应包含页面的内存区域的指针
A pointer to the memory region that should contain the page
addraddr页的线性地址
The linear address of the page
我们知道,在访问交换分区之前,该函数必须检查交换缓存是否已经包含所需的页框。因此,该函数本质上执行以下操作:
As we know, before accessing the swap partition, the function must check whether the swap cache already includes the desired page frame. Therefore, the function essentially executes the following operations:
调用radix_tree_lookup(
)以在对象的基数树中将swapper_space页框定位到换出的页标识符给定的位置entry。如果找到该页,它会增加其引用计数器并返回其描述符的地址。
Invokes radix_tree_lookup(
) to locate in the radix tree of the swapper_space object a page frame at
the position given by the swapped-out page identifier entry. If the page is found, it
increases its reference counter and returns the address of its
descriptor.
该页不包含在交换缓存中。调用
alloc_pages( )分配新的页框。如果没有可用的空闲页框,则返回0(表示系统内存不足)。
The page is not included in the swap cache. Invokes
alloc_pages( ) to allocate a
new page frame. If no free page frame is available, it returns 0
(indicating the system is out of memory).
调用add_to_swap_cache(
)将新页框的页描述符插入到交换缓存中。正如前面的“交换缓存辅助函数”部分中提到的,该函数还会锁定页面。
Invokes add_to_swap_cache(
) to insert the page descriptor of the new page frame
into the swap cache. As mentioned in the earlier section "Swap cache helper
functions," this function also locks the page.
add_to_swap_cache( )如果在交换缓存中找到该页面的重复项,则上一步可能会失败。例如,进程可能会在步骤 2 中阻塞,从而允许另一个进程在同一页槽上启动换入操作。在这种情况下,它释放在步骤2中分配的页框并从步骤1重新开始。
The previous step might fail if add_to_swap_cache( ) finds a duplicate
of the page in the swap cache. For instance, the process could
block in step 2, thus allowing another process to start a
swap-in operation on the same page slot. In this case, it
releases the page frame allocated in step 2 and restarts from
step 1.
调用将页面插入到 LRU 活动列表中(请参阅本章前面的“最近最少使用 (LRU) 列表lru_cache_add_active(
)”部分)。
Invokes lru_cache_add_active(
) to insert the page in the LRU active list (see the
section "The Least
Recently Used (LRU) Lists" earlier in this
chapter).
新页框的页描述符现在位于交换缓存中。调用swap_readpage(
)从交换区域读取页面内容。swap_writepage( )该函数与前面部分“交换出页面”中描述的非常相似,它清除PG_uptodate页面描述符的标志,调用为 I/O 传输get_swap_bio(
)分配和初始化描述符,并调用将 I/O 请求提交给块子系统层。biosubmit_bio( )
The page descriptor of the new page frame is now in the
swap cache. Invokes swap_readpage(
) to read the page's contents from the swap area. This
function is quite similar to swap_writepage( ) described in the
earlier section "Swapping Out
Pages:" it clears the PG_uptodate flag of the page
descriptor, invokes get_swap_bio(
) to allocate and initialize a bio descriptor for the I/O transfer,
and invokes submit_bio( ) to
submit the I/O request to the block subsystem layer.
返回页面描述符的地址。
Returns the address of the page descriptor.
[ * ] “永久”页槽可防止swap_map计数器溢出。如果没有它们,如果引用太多次,有效的页槽可能会变得“有缺陷”,从而导致数据丢失。然而,没有人真正期望页槽计数器可以达到值 32,768。这只是一种“腰带和吊带”的方法。
[*] "Permanent" page slots protect against overflows of swap_map counters. Without them, valid
page slots could become "defective" if they are referenced too
many times, thus leading to data losses. However, no one really
expects that a page slot counter could reach the value 32,768.
It's just a "belt and suspenders" approach.
[ * ]实际上,该页面也可能属于IPC共享内存区域;第19章对此案例进行了讨论。
[*] Actually, the page might also belong to an IPC shared memory region; Chapter 19 has a discussion of this case.
[ * ]您可能已经注意到,Linux 数据结构的名称并不总是合适的。在这种情况下,内核并不真正“聚集”交换区域的页槽。
[*] As you may have noticed, the names of Linux data structures are not always appropriate. In this case, the kernel does not really "cluster" page slots of a swap area.
[ * ]在某些情况下,交换缓存还可以提高系统性能:考虑一个通过创建子进程来服务请求的服务器守护进程。在系统负载较重的情况下,页面可以从父进程中换出,并且永远不会被父进程调入。如果没有交换缓存,每个分叉出来的子进程都需要从交换区域中故障该页面。
[*] In some cases, the swap cache improves also the system performance: consider a server daemon that services requests by creating child processes. Under heavy system load, a page can get swapped out from the parent process, and it will never be paged in for the parent process. Without the swap cache, every child process that gets forked off needs to fault that page in from the swap area.
在本章中,我们通过查看内核在与特定文件系统交互时必须处理的细节来结束对 I/O 和文件系统的广泛讨论。由于第二扩展文件系统 (Ext2) 是 Linux 原生的,并且几乎在每个 Linux 系统上都使用,因此它是本次讨论的自然选择。此外,Ext2 在支持现代文件系统功能和快速性能方面展示了许多良好实践。可以肯定的是,Linux 支持的其他文件系统包含许多有趣的功能,但我们没有空间来检查所有这些功能。
In this chapter, we finish our extensive discussion of I/O and filesystems by taking a look at the details the kernel has to take care of when interacting with a specific filesystem. Because the Second Extended Filesystem (Ext2) is native to Linux and is used on virtually every Linux system, it is a natural choice for this discussion. Furthermore, Ext2 illustrates a lot of good practices in its support for modern filesystem features with fast performance. To be sure, other filesystems supported by Linux include many interesting features, but we have no room to examine all of them.
在“ Ext2的一般特性”一节中介绍了Ext2之后,我们将像其他章节一样描述所需的数据结构。因为我们正在研究在磁盘上存储数据的特定方式,所以我们必须考虑相同数据结构的两个版本。“ Ext2磁盘数据结构”部分显示了Ext2在磁盘上存储的数据结构,而“Ext2内存数据结构”则显示了内存中的相应版本。
After introducing Ext2 in the section "General Characteristics of Ext2," we describe the data structures needed, just as in other chapters. Because we are looking at a specific way to store data on disk, we have to consider two versions of the same data structures. The section "Ext2 Disk Data Structures" shows the data structures stored by Ext2 on disk, while "Ext2 Memory Data Structures" shows the corresponding versions in memory.
然后我们开始在文件系统上执行的操作。在“创建 Ext2 文件系统”一节中,我们讨论如何在磁盘分区中创建 Ext2。接下来的部分描述了每当使用磁盘时执行的内核活动。其中大多数是相对较低级别的活动,涉及将磁盘空间分配给索引节点和数据块。
Then we get to the operations performed on the filesystem. In the section "Creating the Ext2 Filesystem," we discuss how Ext2 is created in a disk partition. The next sections describe the kernel activities performed whenever the disk is used. Most of these are relatively low-level activities dealing with the allocation of disk space to inodes and data blocks.
在上一节中,我们简要描述了 Ext3 文件系统,这是 Ext2 文件系统演变的下一步。
In the last section, we give a short description of the Ext3 filesystem, which is the next step in the evolution of the Ext2 filesystem .
类 Unix 操作系统使用多种类型的文件系统。尽管所有此类文件系统的文件都具有一些 POSIX API(例如 )所需的公共属性子集stat( ),但每个文件系统都以不同的方式实现。
Unix-like operating systems use several types of
filesystems. Although the files of all such filesystems have a common
subset of attributes required by a few POSIX APIs such as stat( ), each filesystem is implemented in a
different way.
Linux 的第一个版本基于 MINIX 文件系统。随着 Linux 的成熟,扩展文件系统(Ext FS)被引入;它包括几个重要的扩展,但提供的性能并不令人满意。第二 个扩展文件系统 (Ext2)于 1994 年推出;除了包括一些新功能之外,它非常高效且强大,并且与其后代 Ext3 一起成为使用最广泛的 Linux 文件系统。
The first versions of Linux were based on the MINIX filesystem. As Linux matured, the Extended Filesystem (Ext FS) was introduced; it included several significant extensions, but offered unsatisfactory performance. The Second Extended Filesystem (Ext2) was introduced in 1994; besides including several new features , it is quite efficient and robust and is, together with its offspring Ext3, the most widely used Linux filesystem.
以下功能有助于提高 Ext2 的效率:
The following features contribute to the efficiency of Ext2:
创建 Ext2 文件系统时,系统管理员可以根据预期的平均文件长度选择最佳块大小(从 1,024 到 4,096 字节)。例如,当平均文件长度小于几千字节时,最好使用 1,024 块大小,因为这会减少内部碎片,即文件长度与存储该文件的磁盘部分之间的不匹配程度较小(参见第8章“内存区域管理”部分,其中讨论了动态内存的内部碎片)。另一方面,对于大于几千字节的文件,通常优选较大的块大小,因为这会减少磁盘传输,从而减少系统开销。
When creating an Ext2 filesystem, the system administrator may choose the optimal block size (from 1,024 to 4,096 bytes), depending on the expected average file length. For instance, a 1,024-block size is preferable when the average file length is smaller than a few thousand bytes because this leads to less internal fragmentation—that is, less of a mismatch between the file length and the portion of the disk that stores it (see the section "Memory Area Management" in Chapter 8, where internal fragmentation for dynamic memory was discussed). On the other hand, larger block sizes are usually preferable for files greater than a few thousand bytes because this leads to fewer disk transfers, thus reducing system overhead.
创建 Ext2 文件系统时,系统管理员可以选择给定大小的分区允许多少个 inode,具体取决于要存储在该分区上的预期文件数量。这可以最大化有效可用的磁盘空间。
When creating an Ext2 filesystem, the system administrator may choose how many inodes to allow for a partition of a given size, depending on the expected number of files to be stored on it. This maximizes the effectively usable disk space.
文件系统将磁盘块分区成组。每组包括存储在相邻磁道中的数据块和索引节点。由于这种结构,存储在单个块组中的文件可以用较低的平均磁盘寻道时间来访问。
The filesystem partitions disk blocks into groups. Each group includes data blocks and inodes stored in adjacent tracks. Thanks to this structure, files stored in a single block group can be accessed with a lower average disk seek time.
文件系统在实际使用磁盘数据块之前将它们预先分配给常规文件。因此,当文件大小增加时,已经在物理相邻位置保留了几个块,从而减少了文件碎片。
The filesystem preallocates disk data blocks to regular files before they are actually used. Thus, when the file increases in size, several blocks are already reserved at physically adjacent positions, reducing file fragmentation.
支持快速符号链接(请参阅第 1 章中的“硬链接和软链接” 部分)。如果符号链接表示短路径名(最多 60 个字符),则可以将其存储在 inode 中,因此可以在不读取数据块的情况下进行翻译。
Fast symbolic links (see the section "Hard and Soft Links" in Chapter 1) are supported. If the symbolic link represents a short pathname (at most 60 characters), it can be stored in the inode and can thus be translated without reading a data block.
此外,第二扩展文件系统还包括其他功能,使其既强大又灵活:
Moreover, the Second Extended Filesystem includes other features that make it both robust and flexible:
仔细实施文件更新,最大限度地减少系统崩溃的影响。例如,当为文件创建新的硬链接时,首先增加磁盘inode中的硬链接计数器,然后将新名称添加到正确的目录中。这样,如果在 inode 更新之后但在可以更改目录之前发生硬件故障,即使 inode 的硬链接计数器错误,目录也是一致的。尽管文件的数据块无法自动回收,但删除文件不会导致灾难性结果。如果执行相反的操作(在更新索引节点之前更改目录),相同的硬件故障将产生危险的不一致:删除原始硬链接将从磁盘中删除其数据块,但新的目录条目将引用不再存在的索引节点。如果该索引节点号稍后用于另一个文件,则写入过时的目录条目会损坏新文件。
A careful implementation of file-updating that minimizes the impact of system crashes. For instance, when creating a new hard link for a file, the counter of hard links in the disk inode is increased first, and the new name is added into the proper directory next. In this way, if a hardware failure occurs after the inode update but before the directory can be changed, the directory is consistent, even if the inode's hard link counter is wrong. Deleting the file does not lead to catastrophic results, although the file's data blocks cannot be automatically reclaimed. If the reverse were done (changing the directory before updating the inode), the same hardware failure would produce a dangerous inconsistency: deleting the original hard link would remove its data blocks from disk, yet the new directory entry would refer to an inode that no longer exists. If that inode number were used later for another file, writing into the stale directory entry would corrupt the new file.
支持启动时文件系统状态的自动一致性检查。检查由e2fsck外部程序执行,该程序不仅可以在系统崩溃后激活,还可以在预定义数量的文件系统挂载(每次挂载操作后计数器增加)或经过预定义的时间后激活。最近的检查。
Support for automatic consistency checks on the filesystem status at boot time. The checks are performed by the e2fsck external program, which may be activated not only after a system crash, but also after a predefined number of filesystem mounts (a counter is increased after each mount operation) or after a predefined amount of time has elapsed since the most recent check.
支持不可变文件(无法修改、删除或重命名)和 仅附加文件(数据只能添加到文件末尾)。
Support for immutable files (they cannot be modified, deleted, or renamed) and for append-only files (data can be added only to the end of them).
与 Unix System V 的兼容性Release 4 和 BSD用户组ID的语义为一个新文件。在SVR4中,新文件采用创建它的进程的用户组ID;在 BSD 中,新文件继承包含它的目录的用户组 ID。Ext2 包含一个安装选项,用于指定要使用的语义。
Compatibility with both the Unix System V Release 4 and the BSD semantics of the user group ID for a new file. In SVR4, the new file assumes the user group ID of the process that creates it; in BSD, the new file inherits the user group ID of the directory containing it. Ext2 includes a mount option that specifies which semantic to use.
即使 Ext2 文件系统是一个成熟、稳定的程序,也已考虑包含一些附加功能。其中一些已经被编码并可作为外部补丁使用。其他的只是计划中的,但在某些情况下,Ext2 inode 中已经为它们引入了字段。正在考虑的最重要的功能是:
Even if the Ext2 filesystem is a mature, stable program, several additional features have been considered for inclusion. Some of them have already been coded and are available as external patches. Others are just planned, but in some cases, fields have already been introduced in the Ext2 inode for them. The most significant features being considered are:
系统管理员通常选择大块大小来访问磁盘,因为计算机应用程序经常处理大文件。结果,存储在大块中的小文件浪费了大量的磁盘空间。这个问题可以通过允许多个文件存储在同一块的不同片段中来解决。
System administrators usually choose large block sizes for accessing disks, because computer applications often deal with large files. As a result, small files stored in large blocks waste a lot of disk space. This problem can be solved by allowing several files to be stored in different fragments of the same block.
创建文件时必须指定的这些新选项允许用户在磁盘上透明地存储其文件的压缩和/或加密版本。
These new options, which must be specified when creating a file, allow users to transparently store compressed and/or encrypted versions of their files on disk.
如果需要,取消删除选项允许用户轻松恢复以前删除的文件的内容。
An undelete option allows users to easily recover, if needed, the contents of a previously removed file.
日志记录避免了在文件系统突然卸载时自动执行的耗时检查(例如,由于系统崩溃)。
Journaling avoids the time-consuming check that is automatically performed on a filesystem when it is abruptly unmounted — for instance, as a consequence of a system crash.
实际上,这些功能都没有正式包含在 Ext2 文件系统中。有人可能会说 Ext2 是其成功的受害者;事实上,Ext2 是其成功的受害者。直到几年前,它一直是大多数 Linux 发行公司采用的首选文件系统,每天依赖它的数百万用户会对任何用其他文件系统取代 Ext2 的尝试持怀疑态度。
In practice, none of these features has been officially included in the Ext2 filesystem. One might say that Ext2 is victim of its success; it has been the preferred filesystem adopted by most Linux distribution companies until a few years ago, and the millions of users who relied on it every day would have looked suspiciously at any attempt to replace Ext2 with some other filesystem.
Ext2 缺少的最引人注目的功能是日志记录,这是高可用性服务器所需要的。为了实现平稳过渡,Ext2 文件系统中尚未引入日志功能;相反,正如我们将在后面的“ Ext3 文件系统”部分中讨论的那样,已经创建了一个与 Ext2 完全兼容的更新的文件系统,它还提供日志功能。并不真正需要日志的用户可能会继续使用旧的 Ext2 文件系统,而其他人可能会采用新的文件系统。如今,大多数发行版都采用 Ext3 作为标准文件系统。
The most compelling feature missing from Ext2 is journaling, which is required by high-availability servers. To provide for a smooth transition, journaling has not been introduced in the Ext2 filesystem; rather, as we'll discuss in the later section "The Ext3 Filesystem," a more recent filesystem that is fully compatible with Ext2 has been created, which also offers journaling. Users who do not really require journaling may continue to use the good old Ext2 filesystem, while the others will likely adopt the new filesystem. Nowadays, most distributions adopt Ext3 as the standard filesystem.
每个 Ext2 分区中的第一个块永远不会由 Ext2 文件系统管理,因为它是为分区引导扇区保留的(请参阅附录 A)。Ext2 分区的其余部分被分成块组 ,每个的布局如图18-1所示。正如您从图中注意到的那样,某些数据结构必须恰好适合一个块,而其他数据结构可能需要多个块。文件系统中的所有块组具有相同的大小并且按顺序存储,因此内核可以简单地从其整数索引导出块组在磁盘中的位置。
The first block in each Ext2 partition is never managed by the Ext2 filesystem, because it is reserved for the partition boot sector (see Appendix A). The rest of the Ext2 partition is split into block groups , each of which has the layout shown in Figure 18-1. As you will notice from the figure, some data structures must fit in exactly one block, while others may require more than one block. All the block groups in the filesystem have the same size and are stored sequentially, thus the kernel can derive the location of a block group in a disk simply from its integer index.
块组减少了文件碎片,因为如果可能的话,内核会尝试将属于同一文件的数据块保留在同一块组中。块组中的每个块都包含以下信息之一:
Block groups reduce file fragmentation, because the kernel tries to keep the data blocks belonging to a file in the same block group, if possible. Each block in a block group contains one of the following pieces of information:
文件系统超级块的副本
A copy of the filesystem's superblock
块组描述符组的副本
A copy of the group of block group descriptors
数据块位图
A data block bitmap
索引节点位图
An inode bitmap
索引节点表
A table of inodes
属于文件的数据块;即数据块
A chunk of data that belongs to a file; i.e., data blocks
如果一个块不包含任何有意义的信息,则称该块是空闲的。
If a block does not contain any meaningful information, it is said to be free.
从图 18-1中可以看出,每个块组中的超级块和组描述符都是重复的。只有超级块和块组0中包含的组描述符被内核使用,而其余的超级块组描述符保持不变;事实上,内核根本不看它们。当e2fsck程序对文件系统状态执行一致性检查时,它会引用存储在块组0中的超级块和组描述符,然后将它们复制到所有其他块组中。如果发生数据损坏并且块组0中的主超级块或主组描述符变得无效,则系统管理员可以指示e2fsck 引用超级块的旧副本以及存储在除第一个块组之外的块组中的组描述符。通常,冗余副本存储足够的信息,以允许e2fsck将 Ext2 分区恢复到一致状态。
As you can see from Figure 18-1, both the superblock and the group descriptors are duplicated in each block group. Only the superblock and the group descriptors included in block group 0 are used by the kernel, while the remaining superblocks and group descriptors are left unchanged; in fact, the kernel doesn't even look at them. When the e2fsck program executes a consistency check on the filesystem status, it refers to the superblock and the group descriptors stored in block group 0, and then copies them into all other block groups. If data corruption occurs and the main superblock or the main group descriptors in block group 0 become invalid, the system administrator can instruct e2fsck to refer to the old copies of the superblock and the group descriptors stored in a block groups other than the first. Usually, the redundant copies store enough information to allow e2fsck to bring the Ext2 partition back to a consistent state.
有多少个块组?嗯,这取决于分区大小和块大小。主要限制是块位图(用于标识组内已使用和空闲的块)必须存储在单个块中。因此,在每个块组中,最多可以有8× b个块,其中 b是块大小(以字节为单位)。因此,块组的总数大致为 s /(8× b ),其中 s是以块为单位的分区大小。
How many block groups are there? Well, that depends both on the partition size and the block size. The main constraint is that the block bitmap, which is used to identify the blocks that are used and free inside a group, must be stored in a single block. Therefore, in each block group, there can be at most 8×b blocks, where b is the block size in bytes. Thus, the total number of block groups is roughly s/(8×b), where s is the partition size in blocks.
例如,我们考虑一个 32 GB 的 Ext2 分区,块大小为 4 KB。在这种情况下,每个 4 KB 块位图描述 32K 数据块,即 128 MB。因此,最多需要256个块组。显然,块大小越小,块组的数量就越多。
For example, let's consider a 32-GB Ext2 partition with a 4-KB block size. In this case, each 4-KB block bitmap describes 32K data blocks — that is, 128 MB. Therefore, at most 256 block groups are needed. Clearly, the smaller the block size, the larger the number of block groups.
Ext2磁盘超级块存储在一个ext2_super_block结构体中,其字段列于表18-1中。[ * ]、和数据类型分别表示_ _u8长度为 8、16 和 32 位的无符号数,而 、、数据类型表示长度为 8、16 和 32 位的有符号数。为了显式指定字或双字的字节在磁盘上的存储顺序,内核还使用 、、和数据类型;前两种类型表示小端排序_ _u16_
_u32_
_s8_ _s16_ _s32_ _le16_ _le32_
_be16_ _be32 分别对于字和双字(最低有效字节存储在最高地址),而后两种类型表示大端排序 (最高有效字节存储在最高地址)。
An Ext2 disk superblock is stored in an ext2_super_block structure, whose fields are
listed in Table
18-1.[*] The _ _u8, _ _u16, and _
_u32 data types denote unsigned numbers of length 8, 16, and
32 bits respectively, while the _
_s8, _ _s16, _ _s32 data types denote signed numbers of
length 8, 16, and 32 bits. To explicitly specify the order in which
the bytes of a word or double-word are stored on disk, the kernel also
makes use of the _ _le16, _ _le32, _
_be16, and _ _be32 data
types; the former two types denote the little-endian
ordering for words and double-words (the least significant byte
is stored at the highest address), respectively, while the latter two
types denote the big-endian ordering (the most significant byte is stored at the highest
address).
表 18-1。Ext2 超级块的字段
Table 18-1. The fields of the Ext2 superblock
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 索引节点总数 Total number of inodes |
| | 文件系统大小(以块为单位) Filesystem size in blocks |
| | 保留块数 Number of reserved blocks |
| | 空闲块计数器 Free blocks counter |
| | 空闲索引节点计数器 Free inodes counter |
| | 第一个有用块的数量(始终为 1) Number of first useful block (always 1) |
| | 块大小 Block size |
| | 碎片大小 Fragment size |
| | 每组的块数 Number of blocks per group |
| | 每组片段数 Number of fragments per group |
| | 每组的索引节点数 Number of inodes per group |
| | 上次挂载操作的时间 Time of last mount operation |
| | 最后一次写操作的时间 Time of last write operation |
| | 安装操作计数器 Mount operations counter |
| | 检查前挂载操作的次数 Number of mount operations before check |
| | 魔术签名 Magic signature |
| | 状态标志 Status flag |
| | 检测到错误时的行为 Behavior when detecting errors |
| | 次要修订级别 Minor revision level |
| | 上次检查时间 Time of last check |
| | 检查之间的时间 Time between checks |
| | 创建文件系统的操作系统 OS where filesystem was created |
| | 文件系统的修订级别 Revision level of the filesystem |
| | 保留块的默认 UID Default UID for reserved blocks |
| | 保留块的默认用户组 ID Default user group ID for reserved blocks |
| | 第一个非保留 inode 的数量 Number of first nonreserved inode |
| | 磁盘上 inode 结构的大小 Size of on-disk inode structure |
| | 该超级块的块组号 Block group number of this superblock |
| | 兼容功能位图 Compatible features bitmap |
| | 不兼容的功能位图 Incompatible features bitmap |
| | 只读兼容功能位图 Read-only compatible features bitmap |
| | 128 位文件系统标识符 128-bit filesystem identifier |
| | 卷名 Volume name |
| | 最后一个挂载点的路径名 Pathname of last mount point |
| | 用于压缩 Used for compression |
| | 预分配的块数 Number of blocks to preallocate |
| | 为目录预分配的块数 Number of blocks to preallocate for directories |
| | 与单词对齐 Alignment to word |
| | 空值填充 1,024 字节 Nulls to pad out 1,024 bytes |
该s_inodes_count字段存储 inode 的数量,而该s_blocks_count字段存储 Ext2 文件系统中块的数量。
The s_inodes_count field
stores the number of inodes, while the s_blocks_count field stores the number of
blocks in the Ext2 filesystem.
该s_log_block_size字段将块大小表示为2的幂,以1024字节为单位。因此,0 表示 1,024 字节块,1 表示 2,048 字节块,依此类推。该s_log_frag_size
字段当前等于s_log_block_size,因为块碎片尚未实现。
The s_log_block_size field
expresses the block size as a power of 2, using 1,024 bytes as the
unit. Thus, 0 denotes 1,024-byte blocks, 1 denotes 2,048-byte blocks,
and so on. The s_log_frag_size
field is currently equal to s_log_block_size, because block
fragmentation is not yet implemented.
、和字段分别存储每个块组中s_blocks_per_group的
块数、片段数和索引节点数。s_frags_per_groups_inodes_per_group
The s_blocks_per_group,
s_frags_per_group, and s_inodes_per_group fields store the number
of blocks, fragments, and inodes in each block group,
respectively.
某些磁盘块保留给超级用户(或由 和 字段选择的某些其他用户或用户组s_def_resuid)s_def_resgid。即使没有更多的空闲块可供普通用户使用,这些块也允许系统管理员继续使用文件系统。
Some disk blocks are reserved to the superuser (or to some other
user or group of users selected by the s_def_resuid and s_def_resgid fields). These blocks allow the
system administrator to continue to use the filesystem even when no
more free blocks are available for normal users.
、、和字段设置要在启动时自动检查的 Ext2 文件系统s_mnt_count。这些字段会导致e2fsck在执行了预定义数量的安装操作后运行,或者在自上次一致性检查以来经过预定义的时间量时运行。(两种检查可以一起使用。)如果文件系统尚未完全卸载(例如,系统崩溃后)或内核发现其中的一些错误,一致性检查也会在启动时强制执行。如果文件系统已安装或未完全卸载,则该字段存储值 0;如果已完全卸载,则该字段存储值 1;如果包含错误,则该字段存储值 2。s_max_mnt_counts_lastchecks_checkintervals_state
The s_mnt_count, s_max_mnt_count, s_lastcheck, and s_checkinterval fields set up the Ext2
filesystem to be checked automatically at boot time. These fields
cause e2fsck to run after a
predefined number of mount operations has been performed, or when a
predefined amount of time has elapsed since the last consistency
check. (Both kinds of checks can be used together.) The consistency
check is also enforced at boot time if the filesystem has not been
cleanly unmounted (for instance, after a system crash) or when the
kernel discovers some errors in it. The s_state field stores the value 0 if the
filesystem is mounted or was not cleanly unmounted, 1 if it was
cleanly unmounted, and 2 if it contains errors.
每个块组都有自己的组描述符,
ext2_group_desc其字段如表18-2所示。
Each block group has its own group descriptor, an
ext2_group_desc structure whose
fields are illustrated in Table 18-2.
表 18-2。Ext2组描述符的字段
Table 18-2. The fields of the Ext2 group descriptor
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | |
| | inode位图的块号 Block number of inode bitmap |
| | 第一个 inode 表块的块号 Block number of first inode table block |
| | 组中空闲块的数量 Number of free blocks in the group |
| | 组中空闲 inode 的数量 Number of free inodes in the group |
| | 组内目录数 Number of directories in the group |
| | 与单词对齐 Alignment to word |
| | 空值填充 24 字节 Nulls to pad out 24 bytes |
、和字段在分配bg_free_blocks_count新
的 inode 和数据块时使用。这些字段确定在其中分配每个数据结构的最合适的块。位图是位序列,其中值 0 指定相应的索引节点或数据块是空闲的,值 1 指定它已使用。由于每个位图必须存储在单个块内,并且块大小可以是 1,024、2,048 或 4,096 字节,因此单个位图描述 8,192、16,384 或 32,768 个块的状态。bg_free_inodes_countbg_used_dirs_count
The bg_free_blocks_count,
bg_free_inodes_count, and bg_used_dirs_count fields are used when
allocating new inodes and data blocks. These fields determine the most
suitable block in which to allocate each data structure. The bitmaps
are sequences of bits, where the value 0 specifies that the
corresponding inode or data block is free and the value 1 specifies
that it is used. Because each bitmap must be stored inside a single
block and because the block size can be 1,024, 2,048, or 4,096 bytes,
a single bitmap describes the state of 8,192, 16,384, or 32,768
blocks.
索引节点表由一系列连续的块组成,每个块包含预定义数量的索引节点。索引节点表的第一个块的块号存储在
bg_inode_table组描述符的字段中。
The inode table consists of a series of consecutive
blocks, each of which contains a predefined number of inodes. The
block number of the first block of the inode table is stored in the
bg_inode_table field of the group
descriptor.
所有 inode 都具有相同的大小:128 字节。1,024 字节的块包含 8 个 inode,而 4,096 字节的块包含 32 个 inode。要计算出 inode 表占用了多少个块,请将组中的 inode 总数(存储在s_inodes_per_group超级块的字段中)除以每个块的 inode 数。
All inodes have the same size: 128 bytes. A 1,024-byte block
contains 8 inodes, while a 4,096-byte block contains 32 inodes. To
figure out how many blocks are occupied by the inode table, divide the
total number of inodes in a group (stored in the s_inodes_per_group field of the superblock)
by the number of inodes per block.
每个 Ext2 inode 都是一个ext2_inode结构体,其字段如表 18-3所示。
Each Ext2 inode is an ext2_inode structure whose fields are
illustrated in Table
18-3.
表 18-3。Ext2 磁盘 inode 的字段
Table 18-3. The fields of an Ext2 disk inode
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 文件类型和访问权限 File type and access rights |
| | 所有者标识符 Owner identifier |
| | 文件长度(以字节为单位) File length in bytes |
| | 上次访问文件的时间 Time of last file access |
| | inode 最后更改的时间 Time that inode last changed |
| | 文件内容上次更改的时间 Time that file contents last changed |
| | 文件删除时间 Time of file deletion |
| | 用户组标识符 User group identifier |
| | 硬链接计数器 Hard links counter |
| | 文件的数据块数 Number of data blocks of the file |
| | 文件标志 File flags |
| | 具体操作系统信息 Specific operating system information |
| | 指向数据块的指针 Pointers to data blocks |
| | 文件版本(当文件被访问时使用 File version (used when the file is accessed by a 网络文件系统) network filesystem) |
| | 文件访问控制列表 File access control list |
| | 目录访问控制列表 Directory access control list |
| | 分片地址 Fragment address |
| | 具体操作系统信息 Specific operating system information |
与POSIX规范相关的许多字段与VFS的inode对象的相应字段类似,并且已经在第12章的“ Inode对象”部分中讨论过。其余的涉及 Ext2 特定的实现,主要处理块分配。
Many fields related to POSIX specifications are similar to the corresponding fields of the VFS's inode object and have already been discussed in the section "Inode Objects" in Chapter 12. The remaining ones refer to the Ext2-specific implementation and deal mostly with block allocation.
具体地,该i_size
字段存储文件的有效长度(以字节为单位),而该
i_blocks字段存储已分配给文件的数据块的数量(以512字节为单位)。
In particular, the i_size
field stores the effective length of the file in bytes, while the
i_blocks field stores the number of
data blocks (in units of 512 bytes) that have been allocated to the
file.
i_size和
的值i_blocks没有必然关系。因为文件总是存储在整数个块中,所以非空文件至少接收一个数据块(因为尚未实现碎片)并且i_size可能小于 512 × i_blocks。另一方面,正如我们将在本章后面的“文件漏洞”部分中看到的,文件可能包含漏洞。在这种情况下,i_size可能大于 512 × i_blocks。
The values of i_size and
i_blocks are not necessarily
related. Because a file is always stored in an integer number of
blocks, a nonempty file receives at least one data block (since
fragmentation is not yet implemented) and i_size may be smaller than 512 ×i_blocks. On the other hand, as we'll see in
the section "File
Holes" later in this chapter, a file may contain holes. In that
case, i_size may be greater than
512 ×i_blocks.
该i_block字段是一个指向块的EXT2_N_BLOCKS指针数组(通常为 15 个),用于标识分配给文件的数据块(请参阅本章后面的“数据块寻址”部分)。
The i_block field is an array
of EXT2_N_BLOCKS (usually 15)
pointers to blocks used to identify the data blocks allocated to the
file (see the section "Data Blocks Addressing"
later in this chapter).
为该字段保留的 32 位将i_size文件大小限制为 4 GB。实际上,该字段的最高位i_size并未使用,因此最大文件大小限制为 2 GB。然而,Ext2 文件系统包含一个“肮脏的技巧”,允许在配备 64 位处理器(例如 AMD 的 Opteron)的系统上存储更大的文件或 IBM 的 PowerPCG5。本质上,i_dir_aclinode 的字段不用于常规文件,代表该i_size字段的 32 位扩展。因此,文件大小以 64 位整数的形式存储在 inode 中。64 位版本的 Ext2 文件系统在某种程度上与 32 位版本兼容,因为在 64 位架构上创建的 Ext2 文件系统可以安装在 32 位架构上,反之亦然。在 32 位体系结构上,无法访问大文件,除非使用设置的标志打开文件
(请参阅第 12 章中的“ open( ) 系统调用”O_LARGEFILE部分)。
The 32 bits reserved for the i_size field limit the file size to 4 GB.
Actually, the highest-order bit of the i_size field is not used, so the maximum
file size is limited to 2 GB. However, the Ext2 filesystem includes a
"dirty trick" that allows larger files on systems that sport a 64-bit
processor such as AMD's Opteron or IBM's PowerPC G5. Essentially, the i_dir_acl field of the inode, which is not
used for regular files, represents a 32-bit extension of the i_size field. Therefore, the file size is
stored in the inode as a 64-bit integer. The 64-bit version of the
Ext2 filesystem is somewhat compatible with the 32-bit version because
an Ext2 filesystem created on a 64-bit architecture may be mounted on
a 32-bit architecture, and vice versa. On a 32-bit architecture, a
large file cannot be accessed, unless opening the file with the
O_LARGEFILE flag set (see the
section "The open( )
System Call" in Chapter
12).
回想一下,VFS 模型要求每个文件都有不同的 inode 号。在Ext2中,不需要在磁盘上存储inode编号和相应块编号之间的映射,因为后者的值可以从块组编号和inode表内的相对位置导出。例如,假设每个块组包含 4,096 个 inode,并且我们想知道 inode 13,021 在磁盘上的地址。此时,该inode属于第三块组,其磁盘地址存储在对应inode表的第733项中。正如您所看到的,inode 编号只是 Ext2 例程用来快速检索磁盘上正确的 inode 描述符的键。
Recall that the VFS model requires each file to have a different inode number. In Ext2, there is no need to store on disk a mapping between an inode number and the corresponding block number because the latter value can be derived from the block group number and the relative position inside the inode table. For example, suppose that each block group contains 4,096 inodes and that we want to know the address on disk of inode 13,021. In this case, the inode belongs to the third block group and its disk address is stored in the 733rd entry of the corresponding inode table. As you can see, the inode number is just a key used by the Ext2 routines to retrieve the proper inode descriptor on disk quickly.
Ext2 inode 格式是文件系统设计者的一种紧身衣。inode 的长度必须是 2 的幂,以避免存储 inode 表的块中出现内部碎片。实际上,Ext2 inode 的 128 个字符中的大部分当前都包含了信息,并且几乎没有剩余空间用于其他字段。另一方面,将 inode 长度扩展到 256 会非常浪费,而且会在使用不同 inode 长度的 Ext2 文件系统之间引入兼容性问题。
The Ext2 inode format is a kind of straitjacket for filesystem designers. The length of an inode must be a power of 2 to avoid internal fragmentation in the blocks that store the inode table. Actually, most of the 128 characters of an Ext2 inode are currently packed with information, and there is little room left for additional fields. On the other hand, expanding the inode length to 256 would be quite wasteful, besides introducing compatibility problems between Ext2 filesystems that use different inode lengths.
引入了扩展属性来克服上述限制。这些属性存储在任何 inode 外部分配的磁盘块上。inode的字段i_file_acl指向包含扩展属性的块。具有相同扩展属性集的不同索引节点可以共享相同的块。
Extended attributes have been introduced to
overcome the above limitation. These attributes are stored on a disk
block allocated outside of any inode. The i_file_acl field of an inode points to the
block containing the extended attributes . Different inodes that have the same set of extended
attributes may share the same block.
每个扩展属性都有一个名称和一个值。它们都被编码为由
ext2_xattr_entry描述符指定的可变长度字符数组。图 18-2显示了 Ext2 中块内扩展属性的布局。每个属性分为两部分:ext2_xattr_entry描述符和属性名称一起放置在块的开头,而属性的值放置在块的末尾。块开头的条目根据属性名称排序,而值的位置是固定的,因为它们是由属性的分配顺序确定的。
Each extended attribute has a name and a value. Both of them are
encoded as variable length arrays of characters, as specified by the
ext2_xattr_entry descriptor. Figure 18-2 shows the
layout in Ext2 of the extended attributes inside a block. Each
attribute is split in two parts: the ext2_xattr_entry descriptor together with
the name of the attribute are placed at the beginning of the block,
while the value of the attribute is placed at the end of the block.
The entries at the beginning of the block are ordered according to the
attribute names, while the positions of the values are fixed, because
they are determined by the allocation order of the attributes.
有许多系统调用用于设置、检索、列出和删除文件的扩展属性。这setxattr( ) ,lsetxattr( )
, 和fsetxattr( )
系统调用设置文件的扩展属性;本质上,它们的不同之处在于如何处理符号链接以及如何指定文件(传递路径名或文件描述符)。同样,getxattr(
) ,lgetxattr( )
, 和fgetxattr( )
系统调用返回扩展属性的值。的listxattr( ),llistxattr( ) , 和flistxattr( )
列出文件的所有扩展属性。最后,
removexattr( ) ,lremovexattr( )
, 和fremovexattr(
) 系统调用从文件中删除扩展属性。
There are many system calls used to set, retrieve, list, and
remove the extended attributes of a file. The setxattr( ) , lsetxattr( )
, and fsetxattr( )
system calls set an extended attribute of a file;
essentially, they differ in how symbolic links are handled, and in how
the file is specified (either passing a pathname or a file
descriptor). Similarly, the getxattr(
) , lgetxattr( )
, and fgetxattr( )
system calls return the value of an extended attribute.
The listxattr( ), llistxattr( ) , and flistxattr( )
list all extended attributes of a file. Finally, the
removexattr( ) , lremovexattr( )
, and fremovexattr(
) system calls remove an extended attribute from a
file.
访问控制列表很早以前就被提出来改进 Unix 文件系统中的文件保护机制。无需将文件的用户分为三个类别(所有者、组和其他),而是可以将访问控制列表 ( ACL ) 与每个文件关联。由于这种列表,用户可以为其每个文件指定特定用户(或用户组)的名称以及授予这些用户的权限。
Access control lists were proposed a long time ago to improve the file protection mechanism in Unix filesystems. Instead of classifying the users of a file under three classes—owner, group, and others—an access control list (ACL) can be associated with each file. Thanks to this kind of list, a user may specify for each of his files the names of specific users (or groups of users) and the privileges to be given to these users.
Linux 2.6 通过使用 inode 扩展属性来完全支持 ACL。事实上,引入扩展属性主要是为了支持ACL。因此,chacl( ) ,setfacl( )
, 和getfacl( )
允许您操作文件的 ACL 的库函数主要依赖于上一节中介绍的setxattr( )和系统调用。getxattr( )
Linux 2.6 fully supports ACLs by making use of inode extended
attributes. As a matter of fact, extended attributes have been
introduced mainly to support ACLs. Therefore, the chacl( ) , setfacl( )
, and getfacl( )
library functions, which allow you to manipulate the
ACLs of a file, rely essentially upon the setxattr( ) and getxattr( ) system calls introduced in the
previous section.
不幸的是,在 POSIX 1003.1 系列标准中定义安全扩展的工作组的成果从未被正式确定为新的 POSIX 标准。因此,现在许多类 UNIX 系统上的不同文件系统类型都支持 ACL,尽管不同实现之间存在许多细微差别。
Unfortunately, the outcome of a working group that defined security extensions within the POSIX 1003.1 family of standards has never been formalized as a new POSIX standard. As a result, ACLs are supported nowadays on different filesystem types on many UNIX-like systems, albeit with a number of subtle differences among the different implementations.
Ext2 识别的不同类型的文件(常规文件、管道等)以不同的方式使用数据块。有些文件不存储数据,因此根本不需要数据块。本节讨论每种类型的存储要求,如 表 18-4中列出。
The different types of files recognized by Ext2 (regular files, pipes, etc.) use data blocks in different ways. Some files store no data and therefore need no data blocks at all. This section discusses the storage requirements for each type, which are listed in Table 18-4.
表 18-4。Ext2 文件类型
Table 18-4. Ext2 file types
文件类型 File_type | 描述 Description |
|---|---|
0 0 | 未知 Unknown |
1 1 | 常规文件 Regular file |
2 2 | 目录 Directory |
3 3 | 字符设备 Character device |
4 4 | 块设备 Block device |
5 5 | 命名管道 Named pipe |
6 6 | 插座 Socket |
7 7 | 符号链接 Symbolic link |
常规文件是最常见的情况,并且在本章中几乎受到所有关注。但普通文件只有在开始有数据时才需要数据块。第一次创建时,常规文件是空的,不需要数据块;truncate( )它也可以通过或清空
open( ) 系统调用。这两种情况都很常见;例如,当您发出包含字符串
>filename 的shell 命令时,shell 将创建一个空文件或截断现有文件。
Regular files are the most common case and receive almost all
the attention in this chapter. But a regular file needs data blocks
only when it starts to have data. When first created, a regular file
is empty and needs no data blocks; it can also be emptied by the
truncate( ) or open( ) system calls. Both situations are common; for
instance, when you issue a shell command that includes the string
>filename, the shell creates
an empty file or truncates an existing one.
Ext2 将目录实现为一种特殊类型的文件,其数据块存储文件名以及相应的索引节点号。特别是,此类数据块包含 类型的结构
ext2_dir_entry_2。该结构的字段如表 18-5所示。该结构具有可变长度,因为最后一个name字段是最多包含EXT2_NAME_LEN字符(通常为 255)的可变长度数组。此外,出于效率考虑,目录项的长度始终为 4 的倍数,因此,\0如有必要,会在文件名末尾添加空字符 ( ) 进行填充。该name_len字段存储实际的文件名长度(参见图18-3)。
Ext2 implements directories as a special kind of file whose
data blocks store filenames together with the corresponding inode
numbers. In particular, such data blocks contain structures of type
ext2_dir_entry_2. The fields of
that structure are shown in Table 18-5. The
structure has a variable length, because the last name field is a variable length array of
up to EXT2_NAME_LEN characters
(usually 255). Moreover, for reasons of efficiency, the length of a
directory entry is always a multiple of 4 and, therefore, null
characters (\0) are added for
padding at the end of the filename, if necessary. The name_len field stores the actual filename
length (see Figure
18-3).
表 18-5。Ext2 目录条目的字段
Table 18-5. The fields of an Ext2 directory entry
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 索引节点号 Inode number |
| | 目录条目长度 Directory entry length |
| | 文件名长度 Filename length |
| | 文件类型 File type |
| | 文件名 Filename |
该file_type字段存储指定文件类型的值(参见表18-4)。该
rec_len字段可以解释为指向下一个有效目录项的指针:它是要添加到目录项的起始地址以获得下一个有效目录项的起始地址的偏移量。要删除目录条目,只需将其字段设置为 0 并适当增加前一个有效条目的字段inode值即可。rec_len仔细阅读图18-3rec_len的字段;您会看到oldfile
条目已被删除,因为usr字段rec_len设置为 12+16( usr和oldfile条目的长度
)。
The file_type field stores
a value that specifies the file type (see Table 18-4). The
rec_len field may be interpreted
as a pointer to the next valid directory entry: it is the offset to
be added to the starting address of the directory entry to get the
starting address of the next valid directory entry. To delete a
directory entry, it is sufficient to set its inode field to 0 and suitably increment
the value of the rec_len field of
the previous valid entry. Read the rec_len field of Figure 18-3 carefully;
you'll see that the oldfile
entry was deleted because the rec_len field of usr is set to 12+16 (the lengths of the
usr and oldfile entries).
如前所述,如果符号链接的路径名最多为 60 个字符,则它将存储在i_blockinode 字段中,该字段由 15 个 4 字节整数组成的数组组成;因此不需要数据块。但是,如果路径名超过 60 个字符,则需要单个数据块。
As stated before, if the pathname of a symbolic link
has up to 60 characters, it is stored in the i_block field of the inode, which consists
of an array of 15 4-byte integers; no data block is therefore
required. If the pathname is longer than 60 characters, however, a
single data block is required.
[ * ]为了确保Ext2和Ext3文件系统之间的兼容性,数据结构包括一些Ext3特定的字段,这些字段在表18-1ext2_super_block中没有显示。
[*] To ensure compatibility between the Ext2 and Ext3
filesystems, the ext2_super_block data structure includes
some Ext3-specific fields, which are not shown in Table 18-1.
为了提高效率,Ext2分区的磁盘数据结构中存储的大部分信息在挂载文件系统时都会被复制到RAM中,从而使内核避免许多后续的磁盘读取操作。要了解某些数据结构更改的频率,请考虑一些基本操作:
For the sake of efficiency, most information stored in the disk data structures of an Ext2 partition are copied into RAM when the filesystem is mounted, thus allowing the kernel to avoid many subsequent disk read operations. To get an idea of how often some data structures change, consider some fundamental operations:
创建新文件时,必须减小s_free_inodes_countExt2 超级块中的字段值以及适当组描述符中的字段值。bg_free_inodes_count
When a new file is created, the values of the s_free_inodes_count field in the Ext2
superblock and of the bg_free_inodes_count field in the proper
group descriptor must be decreased.
如果内核将一些数据追加到现有文件中,从而导致为其分配的数据块数量增加,则必须修改s_free_blocks_countExt2 超级块中的字段和组描述符中的字段的值。bg_free_blocks_count
If the kernel appends some data to an existing file so that
the number of data blocks allocated for it increases, the values of
the s_free_blocks_count field in
the Ext2 superblock and of the bg_free_blocks_count field in the group
descriptor must be modified.
即使只是重写现有文件的一部分也涉及到s_wtimeExt2 超级块字段的更新。
Even just rewriting a portion of an existing file involves an
update of the s_wtime field of
the Ext2 superblock.
由于所有 Ext2 磁盘数据结构都存储在 Ext2 分区的块中,因此内核使用页缓存来使它们保持最新(请参阅第 15 章中的“将脏页写入磁盘”部分)。
Because all Ext2 disk data structures are stored in blocks of the Ext2 partition, the kernel uses the page cache to keep them up-to-date (see the section "Writing Dirty Pages to Disk" in Chapter 15).
表 18-6 指定了与 Ext2 文件系统和文件相关的每种数据类型、磁盘上用于表示其数据的数据结构、内核在内存中使用的数据结构以及用于确定多少数据的经验法则使用缓存。更新频繁的数据总是被缓存;即数据永久保存在内存中并包含在页缓存中,直到对应的Ext2分区被卸载。内核通过始终保持页面的使用计数器大于 0 来获得此结果。
Table 18-6 specifies, for each type of data related to Ext2 filesystems and files, the data structure used on the disk to represent its data, the data structure used by the kernel in memory, and a rule of thumb used to determine how much caching is used. Data that is updated very frequently is always cached; that is, the data is permanently stored in memory and included in the page cache until the corresponding Ext2 partition is unmounted. The kernel gets this result by keeping the page's usage counter greater than 0 at all times.
表 18-6。Ext2数据结构的VFS图像
Table 18-6. VFS images of Ext2 data structures
类型 Type | 磁盘数据结构 Disk data structure | 内存数据结构 Memory data structure | 缓存模式 Caching mode |
|---|---|---|---|
超级街区 Superblock | | | 始终缓存 Always cached |
组描述符 Group descriptor | | | 始终缓存 Always cached |
块位图 Block bitmap | 块中的位数组 Bit array in block | 缓冲区中的位数组 Bit array in buffer | 动态的 Dynamic |
索引节点位图 inode bitmap | 块中的位数组 Bit array in block | 缓冲区中的位数组 Bit array in buffer | 动态的 Dynamic |
索引节点 inode | | | 动态的 Dynamic |
数据块 Data block | 字节数组 Array of bytes | VFS缓冲区 VFS buffer | 动态的 Dynamic |
空闲索引节点 Free inode | | 没有任何 None | 绝不 Never |
空闲块 Free block | 字节数组 Array of bytes | 没有任何 None | 绝不 Never |
从不缓存的数据不会保存在任何缓存中,因为它不代表有意义的信息。相反,始终缓存的数据始终存在于 RAM 中,因此无需从磁盘读取数据(但是,必须定期将数据写回磁盘)。动态模式介于这两个极端之间。在此模式下,只要关联的对象(索引节点、数据块或位图)正在使用,数据就会保存在缓存中;当文件关闭或数据块被删除时,页框回收算法可以从缓存中删除相关数据。
The never-cached data is not kept in any cache because it does not represent meaningful information. Conversely, the always-cached data is always present in RAM, thus it is never necessary to read the data from disk (periodically, however, the data must be written back to disk). In between these extremes lies the dynamic mode. In this mode, the data is kept in a cache as long as the associated object (inode, data block, or bitmap) is in use; when the file is closed or the data block is deleted, the page frame reclaiming algorithm may remove the associated data from the cache.
有趣的是,inode 和块位图并没有永久保存在内存中;相反,它们是在需要时从磁盘读取的。实际上,由于页面缓存,可以避免许多磁盘读取,它将最近使用的磁盘块保留在内存中(请参阅第15 章中的“在页面缓存中存储块”一节)。[ * ]
It is interesting to observe that inode and block bitmaps are not kept permanently in memory; rather, they are read from disk when needed. Actually, many disk reads are avoided thanks to the page cache, which keeps in memory the most recently used disk blocks (see the section "Storing Blocks in the Page Cache" in Chapter 15).[*]
正如第12章“超级块对象”
一节所述,VFS超级块的字段指向包含文件系统特定数据的结构。对于 Ext2,该字段指向 类型的结构体,其中包含以下信息:s_fs_infoext2_sb_info
As stated in the section "Superblock Objects" in
Chapter 12, the s_fs_info field of the VFS superblock points
to a structure containing filesystem-specific data. In the case of
Ext2, this field points to a structure of type ext2_sb_info, which includes the following
information:
大多数磁盘超级块字段
Most of the disk superblock fields
s_sbh指向包含磁盘超级块的缓冲区的缓冲区头的指针
An s_sbh pointer to the
buffer head of the buffer containing the disk superblock
s_es指向包含磁盘超级块的缓冲区的指针
An s_es pointer to the
buffer containing the disk superblock
s_desc_ per_block可以打包在一个块中的组描述符的数量
The number of group descriptors, s_desc_ per_block, that can be packed in
a block
指向s_group_desc包含组描述符的缓冲区的缓冲区头数组的指针(通常,单个条目就足够了)
An s_group_desc pointer
to an array of buffer heads of buffers containing the group
descriptors (usually, a single entry is sufficient)
与挂载状态、挂载选项等相关的其他数据
Other data related to mount state, mount options, and so on
图 18-4ext2_sb_info显示了数据结构和缓冲区以及缓冲区头之间相对于 Ext2 超级块和组描述符的
链接。
Figure 18-4
shows the links between the ext2_sb_info data structures and the buffers
and buffer heads relative to the Ext2 superblock and to the group
descriptors.
当内核挂载 Ext2 文件系统时,它会调用该
ext2_fill_super( )函数为数据结构分配空间并用从磁盘读取的数据填充它们(请参阅第 12 章中的“挂载通用文件系统”部分)。这是该函数的简化描述,强调缓冲区和描述符的内存分配:
When the kernel mounts an Ext2 filesystem, it invokes the
ext2_fill_super( ) function to
allocate space for the data structures and to fill them with data read
from disk (see the section "Mounting a Generic
Filesystem" in Chapter
12). This is a simplified description of the function, which
emphasizes the memory allocations for buffers and descriptors:
分配一个ext2_sb_info
描述符并将其地址存储在s_fs_info作为参数传递的超级块对象的字段中。
Allocates an ext2_sb_info
descriptor and stores its address in the s_fs_info field of the superblock object
passed as the parameter.
调用_ _bread( )在缓冲区页中分配缓冲区以及相应的缓冲区头,并将超级块从磁盘读入缓冲区;正如第 15 章“在页高速缓存中搜索块”一节中所讨论的,如果该块已经存储在页高速缓存中的缓冲页中并且是最新的,则不执行分配。将缓冲区头地址存储在Ext2 超级块对象的字段中。s_sbh
Invokes _ _bread( ) to
allocate a buffer in a buffer page together with the corresponding
buffer head, and to read the superblock from disk into the buffer;
as discussed in the section "Searching Blocks in the
Page Cache" in Chapter
15, no allocation is performed if the block is already
stored in a buffer page in the page cache and it is up-to-date.
Stores the buffer head address in the s_sbh field of the Ext2 superblock
object.
分配一个字节数组(每组一个字节)并将其地址存储在描述符s_debts的字段中(请参阅本章后面的“创建索引节点ext2_sb_info”部分)。
Allocates an array of bytes—one byte for each group—and
stores its address in the s_debts field of the ext2_sb_info descriptor (see the section
"Creating
inodes" later in this chapter).
分配一个指向缓冲区头的指针数组,每个组描述符一个指针,并将该数组的地址存储在
描述符s_group_desc的字段
中ext2_sb_info。
Allocates an array of pointers to buffer heads, one for each
group descriptor, and stores the address of the array in the
s_group_desc field of the
ext2_sb_info descriptor.
重复调用_ _bread(
)以分配缓冲区并从磁盘读取包含 Ext2 组描述符的块;将缓冲区头的地址存储在s_group_desc上一步分配的数组中。
Invokes repeatedly _ _bread(
) to allocate buffers and to read from disk the blocks
containing the Ext2 group descriptors; stores the addresses of the
buffer heads in the s_group_desc array allocated in the
previous step.
为根目录分配一个inode和一个dentry对象,并设置超级块对象的一些字段,以便可以从磁盘读取根inode。
Allocates an inode and a dentry object for the root directory, and sets up a few fields of the superblock object so that it will be possible to read the root inode from disk.
ext2_fill_super( )显然,函数返回后,所分配的所有数据结构都保存在内存中;仅当卸载 Ext2 文件系统时,它们才会被释放。当内核必须修改 Ext2 超级块中的字段时,它只需将新值写入相应缓冲区的适当位置,然后将该缓冲区标记为脏。
Clearly, all the data structures allocated by ext2_fill_super( ) are kept in memory after
the function returns; they will be released only when the Ext2
filesystem will be unmounted. When the kernel must modify a field in
the Ext2 superblock, it simply writes the new value in the proper
position of the corresponding buffer and then marks the buffer as
dirty.
打开文件时,会执行路径名查找。对于尚未在 dentry 缓存中的路径名的每个组成部分,一个新的 dentry 对象和一个新的 inode 对象被创建(参见第 12 章中的“标准路径名查找”部分)。当 VFS 访问 Ext2 磁盘 inode 时,它会创建相应的类型为 的inode 描述符。该描述符包含以下信息:ext2_inode_info
When opening a file, a pathname lookup is performed. For each
component of the pathname that is not already in the dentry
cache , a new dentry object and a new inode object are
created (see the section "Standard Pathname
Lookup" in Chapter
12). When the VFS accesses an Ext2 disk inode, it creates a
corresponding inode descriptor of type ext2_inode_info. This descriptor includes
the following information:
The whole VFS inode object (see Table 12-3 in Chapter 12) stored in the
field vfs_inode
磁盘 inode 结构中发现的大部分字段未保存在 VFS inode 中
Most of the fields found in the disk's inode structure that are not kept in the VFS inode
i_block_groupinode 所属的块组索引(请参阅本章前面的“ Ext2磁盘数据结构”部分)
The i_block_group block
group index at which the inode belongs (see the section "Ext2 Disk Data
Structures" earlier in this chapter)
和字段,分别存储最近分配给文件的磁盘块的逻辑块号和物理块i_next_alloc_block
号i_next_alloc_goal
The i_next_alloc_block
and i_next_alloc_goal fields,
which store the logical block number and the physical block number
of the disk block that was most recently allocated to the file,
respectively
和字段,用于数据块预分配(参见本章后面的“分配数据块i_prealloc_block”
部分)i_prealloc_count
The i_prealloc_block and
i_prealloc_count fields, which
are used for data block preallocation (see the section "Allocating a Data
Block" later in this chapter)
字段xattr_sem,一个读/写信号量,允许扩展属性与文件数据同时读取
The xattr_sem field, a
read/write semaphore that allows extended attributes to be read
concurrently with the file data
和i_acl字段i_default_acl,指向文件的 ACL
The i_acl and i_default_acl fields, which point to the
ACLs of the file
处理Ext2文件时,alloc_inode通过该ext2_alloc_inode( )
函数实现超级块方法。ext2_inode_info它首先从slab分配器缓存中获取一个描述符ext2_inode_cachep,然后返回嵌入新
ext2_inode_info描述符中的inode对象的地址。
When dealing with Ext2 files, the alloc_inode superblock method is implemented
by means of the ext2_alloc_inode( )
function. It gets first an ext2_inode_info descriptor from the ext2_inode_cachep slab allocator cache, then
it returns the address of the inode object embedded in the new
ext2_inode_info descriptor.
创建过程一般分为两个阶段磁盘上的文件系统。第一步是对其进行格式化,以便磁盘驱动程序可以在其上读取和写入块。现代硬盘在出厂时已预先格式化,无需重新格式化;软盘可以使用superformat或fdformat等实用程序在 Linux 上进行格式化 。第二步涉及创建文件系统,这意味着设置本章前面详细描述的结构。
There are generally two stages to creating a filesystem on a disk. The first step is to format it so that the disk driver can read and write blocks on it. Modern hard disks come preformatted from the factory and need not be reformatted; floppy disks may be formatted on Linux using a utility program such as superformat or fdformat. The second step involves creating a filesystem, which means setting up the structures described in detail earlier in this chapter.
Ext2 文件系统由mke2fs实用程序创建;它采用以下默认选项,用户可以使用命令行上的标志来修改这些选项:
Ext2 filesystems are created by the mke2fs utility program; it assumes the following default options, which may be modified by the user with flags on the command line:
块大小:1,024 字节(小型文件系统的默认值)
Block size: 1,024 bytes (default value for a small filesystem)
Fragment size:块大小(未实现块分片)
Fragment size: block size (block fragmentation is not implemented)
分配的 inode 数量:每 8,192 字节 1 个 inode
Number of allocated inodes: 1 inode for each 8,192 bytes
保留块的百分比:5%
Percentage of reserved blocks: 5 percent
该程序执行以下操作:
The program performs the following actions:
初始化超级块和组描述符。
Initializes the superblock and the group descriptors.
可选地,检查分区是否包含有缺陷的块;如果是,它会创建一个有缺陷的块列表。
Optionally, checks whether the partition contains defective blocks; if so, it creates a list of defective blocks.
对于每个块组,保留存储超级块、组描述符、inode 表和两个位图所需的所有磁盘块。
For each block group, reserves all the disk blocks needed to store the superblock, the group descriptors, the inode table, and the two bitmaps.
将每个块组的inode位图和数据映射位图初始化为0。
Initializes the inode bitmap and the data map bitmap of each block group to 0.
初始化每个块组的索引节点表。
Initializes the inode table of each block group.
创建/root 目录。
Creates the /root directory.
创建lost+found目录, e2fsck 使用该目录链接丢失和找到的缺陷块。
Creates the lost+found directory, which is used by e2fsck to link the lost and found defective blocks.
更新之前创建的两个目录的块组的索引节点位图和数据块位图。
Updates the inode bitmap and the data block bitmap of the block group in which the two previous directories have been created.
将Lost+Found目录中的缺陷块(如果有)分组。
Groups the defective blocks (if any) in the lost+found directory.
让我们考虑一下mke2fs如何使用默认选项初始化 Ext2 1.44 MB 软盘 。
Let's consider how an Ext2 1.44 MB floppy disk is initialized by mke2fs with the default options.
安装后,它在 VFS 中显示为由 1,412 个块组成的卷;每个长度为 1,024 字节。要检查磁盘的内容,我们可以执行 Unix 命令:
Once mounted, it appears to the VFS as a volume consisting of 1,412 blocks; each one is 1,024 bytes in length. To examine the disk's contents, we can execute the Unix command:
$ dd if=/dev/fd0 bs=1k 计数=1440 | od -tx1 -Ax > /tmp/dump_hex
$ dd if=/dev/fd0 bs=1k count=1440 | od -tx1 -Ax > /tmp/dump_hex
获取/tmp目录中包含软盘内容的十六进制转储的文件 。[ * ]
to get a file containing the hexadecimal dump of the floppy disk contents in the /tmp directory.[*]
通过查看该文件,我们可以看到,由于磁盘容量有限,单个组描述符就足够了。我们还注意到,保留块的数量设置为 72(1,440 的 5%),并且根据默认选项,inode 表必须为每 8,192 字节包含 1 个 inode,即 23 个块中存储 184 个 inode。
By looking at that file, we can see that, due to the limited capacity of the disk, a single group descriptor is sufficient. We also notice that the number of reserved blocks is set to 72 (5 percent of 1,440) and, according to the default option, the inode table must include 1 inode for each 8,192 bytes — that is, 184 inodes stored in 23 blocks.
表 18-7 总结了选择默认选项时如何在软盘上创建 Ext2 文件系统。
Table 18-7 summarizes how the Ext2 filesystem is created on a floppy disk when the default options are selected.
表 18-7。软盘的 Ext2 块分配
Table 18-7. Ext2 block allocation for a floppy disk
堵塞 Block | 内容 Content |
|---|---|
0 0 | 引导块 Boot block |
1 1 | 超级街区 Superblock |
2 2 | 包含单个块组描述符的块 Block containing a single block group descriptor |
3 3 | 数据块位图 Data block bitmap |
4 4 | 索引节点位图 inode bitmap |
5-27 5-27 | inode 表:inode 最多 10 个:保留(inode 2 是根);索引节点 11: 丢失+找到;索引节点 12-184:免费 inode table: inodes up to 10: reserved (inode 2 is the root); inode 11: lost+found; inodes 12-184: free |
28 28 | 根目录(包括 Root directory (includes |
29 29 | 丢失+找到的
目录(包括 lost+found
directory (includes |
30-40 30-40 | 为丢失+找到的目录预先分配的保留块 Reserved blocks preallocated for lost+found directory |
41-1439 41-1439 | 空闲块 Free blocks |
许多VFS方法第12章中描述的都有相应的Ext2实现。因为需要一整本书来描述所有这些方法,所以我们只简单回顾一下 Ext2 中实现的方法。一旦清楚地理解了磁盘和内存数据结构,读者应该能够遵循实现它们的 Ext2 函数的代码。
Many of the VFS methods described in Chapter 12 have a corresponding Ext2 implementation. Because it would take a whole book to describe all of them, we limit ourselves to briefly reviewing the methods implemented in Ext2. Once the disk and the memory data structures are clearly understood, the reader should be able to follow the code of the Ext2 functions that implement them.
许多VFS超级块操作Ext2中有具体的实现,即alloc_inode, destroy_inode, read_inode, write_inode, delete_inode, put_super, write_super, statfs, remount_fs, 和clear_inode。超级块方法的地址存储在ext2_sops
指针数组中。
Many VFS superblock operations have a specific implementation in Ext2, namely alloc_inode, destroy_inode, read_inode, write_inode, delete_inode, put_super, write_super, statfs, remount_fs, and clear_inode. The addresses of the superblock
methods are stored in the ext2_sops
array of pointers.
VFS inode的一些操作在Ext2中有具体的实现,这取决于inode所引用的文件的类型。
Some of the VFS inode operations have a specific implementation in Ext2, which depends on the type of the file to which the inode refers.
Ext2常规文件和Ext2目录的inode操作如表18-8所示;第 12 章“ Inode 对象”
部分描述了每种方法的用途。该表没有显示常规文件和目录未定义的方法(指针);回想一下,如果方法未定义,VFS 要么调用通用函数,要么不执行任何操作。常规文件和目录的 Ext2 方法的地址分别存储在
和
表中。NULLext2_file_inode_operationsext2_dir_inode_operations
The inode operations for Ext2 regular files and Ext2 directories
are shown in Table
18-8; the purpose of each method is described in the section
"Inode Objects" in
Chapter 12. The table does
not show the methods that are undefined (a NULL pointer) for both regular files and
directories; recall that if a method is undefined, the VFS either
invokes a generic function or does nothing at all. The addresses of
the Ext2 methods for regular files and directories are stored in the
ext2_file_inode_operations and
ext2_dir_inode_operations tables,
respectively.
表 18-8。常规文件和目录的 Ext2 inode 操作
Table 18-8. Ext2 inode operations for regular files and directories
VFS inode操作 VFS inode operation | 常规文件 Regular file | 目录 Directory |
|---|---|---|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Ext2符号链接的inode操作如表18-9所示(未定义的方法已被省略)。实际上,有两种类型的符号链接:快速符号链接表示可以完全存储在 inode 内的路径名,而常规符号链接表示较长的路径名。相应地,有两组inode操作,分别存储在ext2_fast_symlink_inode_operations和
ext2_symlink_inode_operations
表中。
The inode operations for Ext2 symbolic links are shown in Table 18-9 (undefined
methods have been omitted). Actually, there are two types of symbolic
links: the fast symbolic links represent pathnames that can be fully
stored inside the inodes, while the regular symbolic links represent
longer pathnames. Accordingly, there are two sets of inode operations,
which are stored in the ext2_fast_symlink_inode_operations and
ext2_symlink_inode_operations
tables, respectively.
表 18-9。用于快速和常规符号链接的 Ext2 inode 操作
Table 18-9. Ext2 inode operations for fast and regular symbolic links
VFS inode操作 VFS inode operation | 快速符号链接 Fast symbolic link | 常规符号链接 Regular symbolic link |
|---|---|---|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
如果索引节点引用字符设备文件、块设备文件或命名管道(请参阅第 19 章中的“ FIFO ” ),则索引节点操作不依赖于文件系统。它们分别在、和表中指定。chrdev_inode_operationsblkdev_inode_operationsfifo_inode_operations
If the inode refers to a character device file, to a block
device file, or to a named pipe (see "FIFOs" in Chapter 19), the inode operations
do not depend on the filesystem. They are specified in the chrdev_inode_operations, blkdev_inode_operations, and fifo_inode_operations tables,
respectively.
文件操作表 18-10列出了 Ext2 文件系统特有的特性。正如您所看到的,一些 VFS 方法是通过许多文件系统通用的通用函数来实现的。这些方法的地址存储在ext2_file_operations
表中。
The file operations specific to the Ext2 filesystem are listed in Table 18-10. As you can
see, several VFS methods are implemented by generic functions that are
common to many filesystems. The addresses of these methods are stored
in the ext2_file_operations
table.
表 18-10。Ext2文件操作
Table 18-10. Ext2 file operations
VFS文件操作 VFS file operation | 外部2方法 Ext2 method |
|---|---|
| |
| |
| |
aio_read aio_read | generic_file_aio_read() generic_file_aio_read( ) |
aio_write aio_write | generic_file_aio_write() generic_file_aio_write( ) |
| |
| |
| |
| |
| |
阅读 readv | generic_file_readv() generic_file_readv( ) |
写v writev | generic_file_writev() generic_file_writev( ) |
发送文件 sendfile | generic_file_sendfile() generic_file_sendfile( ) |
请注意,Ext2 的read
和方法分别由和
函数write实现。这些内容在第 16 章的“从文件读取”和“写入文件”部分中进行了描述。generic_file_read( )generic_file_write( )
Notice that the Ext2's read
and write methods are implemented
by the generic_file_read( ) and
generic_file_write( ) functions,
respectively. These are described in the sections "Reading from a File" and
"Writing to a
File" in Chapter
16.
磁盘上文件的存储与程序员对文件的看法有两种不同:块可以分散在磁盘周围(尽管文件系统努力保持块连续以提高访问时间),并且文件可能会出现在磁盘上。程序员比实际情况更大,因为程序可以在其中引入漏洞(通过lseek( )系统调用)。
The storage of a file on disk differs from the view the
programmer has of the file in two ways: blocks can be scattered around
the disk (although the filesystem tries hard to keep blocks sequential
to improve access time), and files may appear to a programmer to be
bigger than they really are because a program can introduce holes into
them (through the lseek( ) system
call).
在本节中,我们将解释 Ext2 文件系统如何管理磁盘空间——如何分配和释放 inode 和数据块。必须解决两个主要问题:
In this section, we explain how the Ext2 filesystem manages the disk space — how it allocates and deallocates inodes and data blocks. Two main problems must be addressed:
空间管理要尽力避免 文件碎片 — 将文件物理存储为位于不相邻磁盘块中的多个小块。文件碎片增加了文件顺序读取操作的平均时间,因为在读取操作期间必须频繁地重新定位磁盘头。[ * ]这个问题与第 8 章“好友系统算法”一节中讨论的 RAM 外部碎片类似。
Space management must make every effort to avoid file fragmentation — the physical storage of a file in several, small pieces located in non-adjacent disk blocks. File fragmentation increases the average time of sequential read operations on the files, because the disk heads must be frequently repositioned during the read operation.[*] This problem is similar to the external fragmentation of RAM discussed in the section "The Buddy System Algorithm" in Chapter 8.
空间管理必须高效;也就是说,内核应该能够快速从文件偏移中导出 Ext2 分区中相应的逻辑块号。这样做时,内核应尽可能限制对存储在磁盘上的寻址表的访问次数,因为每次此类中间访问都会显着增加平均文件访问时间。
Space management must be time-efficient; that is, the kernel should be able to quickly derive from a file offset the corresponding logical block number in the Ext2 partition. In doing so, the kernel should limit as much as possible the number of accesses to addressing tables stored on disk, because each such intermediate access considerably increases the average file access time.
该ext2_new_inode( )
函数创建一个 Ext2 磁盘 inode,返回相应 inode 对象的地址(或者NULL,如果失败)。该函数仔细选择包含新索引节点的块组;这样做是为了将不相关的目录分布在不同的组中,同时将文件放入与其父目录相同的组中。为了平衡块组中常规文件和目录的数量,Ext2 为每个块组引入了“debt”参数。
The ext2_new_inode( )
function creates an Ext2 disk inode, returning the address of the
corresponding inode object (or NULL, in case of failure). The function
carefully selects the block group that contains the new inode; this is
done to spread unrelated directories among different groups and, at
the same time, to put files into the same group as their parent
directories. To balance the number of regular files and directories in
a block group, Ext2 introduces a "debt" parameter for every block
group.
该函数作用于两个参数:inode 对象的地址dir(引用新 inode 必须插入的目录)和 amode指示正在创建的 inode 的类型。后一个参数还包括挂载标志(请参阅第 12 章中的“挂载通用文件系统”MS_SYNCHRONOUS部分),该标志要求挂起当前进程,直到分配 inode。该函数执行以下操作:
The function acts on two parameters: the address dir of the inode object that refers to the
directory into which the new inode must be inserted and a mode that indicates the type of inode being
created. The latter argument also includes the MS_SYNCHRONOUS mount flag (see the section
"Mounting a Generic
Filesystem" in Chapter
12) that requires the current process to be suspended until the
inode is allocated. The function performs the following
actions:
调用new_inode( )分配一个新的VFS inode对象;将其i_sb字段初始化为存储在 中的超级块地址dir->i_sb,并将其添加到正在使用的 inode 列表和超级块的列表中(请参阅第 12 章中的“ Inode 对象”部分)。
Invokes new_inode( ) to
allocate a new VFS inode object; initializes its i_sb field to the superblock address
stored in dir->i_sb, and
adds it to the in-use inode list and to the superblock's list (see
the section "Inode
Objects" in Chapter
12).
如果新的 inode 是目录,则调用该函数
find_group_orlov( )来为该目录查找合适的块组。[ * ]该函数实现以下启发式:
以文件系统根为父目录的目录应分布在所有块组中。因此,该函数搜索块组,寻找具有多个空闲索引节点和多个高于平均值的空闲块的组。如果不存在这样的组,则跳转到步骤2c。
嵌套目录(没有文件系统根目录作为父目录)如果满足以下规则,则应放入父目录组中:
该组不包含太多目录
该组还剩下足够数量的空闲 inode
s_debts该组有一个小的“债务”(块组的债务存储在描述符
字段指向的计数器数组中
ext2_sb_info
;每次添加新目录时债务都会增加,每次另一种类型的文件会减少债务被添加)
如果父组不满足这些规则,它将选择第一个满足这些规则的组。如果不存在这样的组,则跳转到步骤 2c。
这是“后备”规则,如果没有找到好的组,则使用该规则。该函数从包含父目录的块组开始,选择第一个具有比每个块组的平均空闲 inode 数量更多的空闲 inode 的块组。
If the new inode is a directory, the function invokes
find_group_orlov( ) to find a
suitable block group for the directory.[*] This function implements the following
heuristics:
Directories having as parent the filesystem root should be spread among all block groups. Thus, the function searches the block groups looking for a group having a number of free inodes and a number of free blocks above the average. If there is no such group, it jumps to step 2c.
Nested directories—not having the filesystem root as parent—should be put in the group of the parent if it satisfies the following rules:
The group does not contain too many directories
The group has a sufficient number of free inodes left
The group has a small "debt" (the debt of a block
group is stored in the array of counters pointed to by the
s_debts field of the
ext2_sb_info
descriptor; the debt is increased each time a new
directory is added and decreased each time another type of
file is added)
If the parent's group does not satisfy these rules, it picks the first group that satisfies them. If no such group exists, it jumps to step 2c.
This is the "fallback" rule, to be used if no good group has been found. The function starts with the block group containing the parent directory and selects the first block group that has more free inodes than the average number of free inodes per block group.
如果新的 inode 不是目录,它会调用find_group_other( )将其分配到具有空闲 inode 的块组中。该函数通过从包含父目录的目录开始并远离它来选择组;准确地说:
从包含父目录的块组开始执行快速对数搜索dir。该算法搜索 log( n ) 块组,其中
n是块组的总数。算法会进一步向前跳转,直到找到可用的块组 - 例如,如果我们将起始块组的编号称为i,则算法会考虑块组i mod ( n ),
i +1
mod ( n ),
i +1 +2
mod ( n )、
i +1+2+4
mod ( n ) 等。
如果对数搜索未能找到具有空闲 inode 的块组,则该函数从包含父目录的块组开始执行详尽的线性搜索dir。
If the new inode is not a directory, it invokes find_group_other( ) to allocate it in a
block group having a free inode. This function selects the group
by starting from the one that contains the parent directory and
moving farther away from it; to be precise:
Performs a quick logarithmic search starting from the
block group that includes the parent directory dir. The algorithm searches
log(n) block groups, where
n is the total number of block groups.
The algorithm jumps further ahead until it finds an available
block group — for example, if we call the number of the
starting block group i, the algorithm
considers block groups i
mod(n),
i+1
mod(n),
i+1+2
mod(n),
i+1+2+4
mod(n), etc.
If the logarithmic search failed in finding a block
group with a free inode, the function performs an exhaustive
linear search starting from the block group that includes the
parent directory dir.
调用read_inode_bitmap(
)获取所选块组的inode位图,并在其中查找第一个空位,从而获得第一个空闲磁盘inode的编号。
Invokes read_inode_bitmap(
) to get the inode bitmap of the selected block group
and searches for the first null bit into it, thus obtaining the
number of the first free disk inode.
分配磁盘inode:设置inode位图中的相应位,并将包含该位图的缓冲区标记为脏。此外,如果文件系统已挂载并指定了标志
(请参阅第 12 章中的“挂载通用文件系统”MS_SYNCHRONOUS部分),则该函数将调用以启动 I/O 写操作并等待操作终止。sync_dirty_buffer( )
Allocates the disk inode: sets the corresponding bit in the
inode bitmap and marks the buffer containing the bitmap as dirty.
Moreover, if the filesystem has been mounted specifying the
MS_SYNCHRONOUS flag (see the
section "Mounting a
Generic Filesystem" in Chapter 12), the function
invokes sync_dirty_buffer( ) to
start the I/O write operation and waits until the operation
terminates.
减少bg_free_inodes_count组描述符的字段。如果新的 inode 是目录,则该函数会增加该bg_used_dirs_count字段并将包含组描述符的缓冲区标记为脏。
Decreases the bg_free_inodes_count field of the group
descriptor. If the new inode is a directory, the function
increases the bg_used_dirs_count field and marks the
buffer containing the group descriptor as dirty.
s_debts根据 inode 是否引用常规文件或目录,增加或减少超级块数组中的组计数器。
Increases or decreases the group's counter in the s_debts array of the superblock,
according to whether the inode refers to a regular file or a
directory.
减少数据结构s_freeinodes_counter的字段
ext2_sb_info;而且,如果新的inode是一个目录,它会增加
数据结构s_dirs_counter中的字段
ext2_sb_info。
Decreases the s_freeinodes_counter field of the
ext2_sb_info data structure;
moreover, if the new inode is a directory, it increases the
s_dirs_counter field in the
ext2_sb_info data
structure.
将超级块的标志设置s_dirt为 1,并将包含它的缓冲区标记为脏。
Sets the s_dirt flag of
the superblock to 1, and marks the buffer that contains it to as
dirty.
s_dirt将VFS 的超级块对象的字段设置为 1。
Sets the s_dirt field of
the VFS's superblock object to 1.
初始化 inode 对象的字段。特别是,它设置 inode 编号i_no
并将 的值复制xtime.tv_sec到i_atime、i_mtime和中i_ctime。还使用块组索引加载结构i_block_group中的字段。这些字段的含义ext2_inode_info请参见表18-3 。
Initializes the fields of the inode object. In particular,
it sets the inode number i_no
and copies the value of xtime.tv_sec into i_atime, i_mtime, and i_ctime. Also loads the i_block_group field in the ext2_inode_info structure with the block
group index. Refer to Table 18-3 for the
meaning of these fields.
初始化 inode 的 ACL。
Initializes the ACLs of the inode.
将新的 inode 对象插入到哈希表中inode_hashtable,并调用将 inode 对象移动到超级块的脏 inode 列表中(请参阅第 12 章中的“ Inode 对象”mark_inode_dirty( )部分)。
Inserts the new inode object into the hash table inode_hashtable and invokes mark_inode_dirty( ) to move the inode
object into the superblock's dirty inode list (see the section
"Inode
Objects" in Chapter
12).
调用ext2_preread_inode(
)从磁盘读取包含 inode 的块并将该块放入页缓存中。完成这种类型的预读是因为最近创建的索引节点很可能很快就会被写回。
Invokes ext2_preread_inode(
) to read from disk the block containing the inode and
to put the block in the page cache. This type of read-ahead is
done because it is likely that a recently created inode will be
written back soon.
返回新 inode 对象的地址。
Returns the address of the new inode object.
该ext2_free_inode(
)函数删除一个磁盘 inode,该 inode 由 inode 对象标识,该 inode 对象的地址inode作为参数传递。内核应在涉及内部数据结构和文件本身数据的一系列清理操作后调用该函数。它应该在从 inode 哈希表中删除 inode 对象之后,在从正确的目录中删除引用该 inode 的最后一个硬链接之后,以及在文件被截断为 0 长度以回收其所有数据块之后(请参阅本章后面的“释放数据块”部分)。它执行以下操作:
The ext2_free_inode(
) function deletes a disk inode, which is identified by an
inode object whose address inode is
passed as the parameter. The kernel should invoke the function after a
series of cleanup operations involving internal data structures and
the data in the file itself. It should come after the inode object has
been removed from the inode hash table, after the last hard link
referring to that inode has been deleted from the proper directory and
after the file is truncated to 0 length to reclaim all its data blocks
(see the section "Releasing a Data Block"
later in this chapter). It performs the following actions:
调用clear_inode( ),依次执行以下操作:
删除与 inode 关联的任何脏“间接”缓冲区(请参阅后面的“数据块寻址”部分);它们被收集在以对象
private_list字段为头的列表中(参见第15章中的“地址空间对象”部分)。address_spaceinode->i_data
如果I_LOCKinode的标志被设置,则该inode的一些缓冲区将参与I/O数据传输;该函数暂停当前进程,直到这些 I/O 数据传输终止。
调用clear_inode超级块对象的方法(如果已定义);Ext2 文件系统没有定义它。
如果 inode 引用设备文件,则从设备的 inode 列表中删除该 inode 对象;该列表要么植根于字符设备描述list
符字段(参见第 13 章中的“字符设备驱动程序”cdev部分),要么植根于块设备描述符字段中(参见第 14 章中的“块设备”部分)。bd_inodesblock_device
将 inode 的状态设置为I_CLEAR(inode 对象内容不再有意义)。
Invokes clear_inode( ),
which in turn executes the following operations:
Removes any dirty "indirect" buffer associated with the
inode (see the later section "Data Blocks
Addressing"); they are collected in the list headed at
the private_list field of
the address_space object
inode->i_data (see the
section "The
address_space Object" in Chapter 15).
If the I_LOCK flag of
the inode is set, some of the inode's buffers are involved in
I/O data transfers; the function suspends the current process
until these I/O data transfers terminate.
Invokes the clear_inode method of the superblock
object, if defined; the Ext2 filesystem does not define
it.
If the inode refers to a device file, it removes the
inode object from the device's list of inodes; this list is
rooted either in the list
field of the cdev character
device descriptor (see the section "Character Device
Drivers" in Chapter
13) or in the bd_inodes field of the block_device block device descriptor
(see the section "Block Devices"
in Chapter
14).
Sets the state of the inode to I_CLEAR (the inode object contents
are no longer meaningful).
根据 inode 编号和每个块组中的 inode 数量计算包含磁盘 inode 的块组的索引。
Computes the index of the block group containing the disk inode from the inode number and the number of inodes in each block group.
调用read_inode_bitmap(
)以获取 inode 位图。
Invokes read_inode_bitmap(
) to get the inode bitmap.
增加bg_free_inodes_count(
)组描述符的字段。如果删除的inode是目录,则减少该bg_used_dirs_count字段。将包含组描述符的缓冲区标记为脏。
Increases the bg_free_inodes_count(
) field of the group descriptor. If the deleted inode is
a directory, it decreases the bg_used_dirs_count field. Marks the
buffer that contains the group descriptor as dirty.
如果删除的inode是一个目录,它会减少
数据结构s_dirs_counter中的字段
ext2_sb_info,将s_dirt超级块的标志设置为1,并将包含它的缓冲区标记为脏。
If the deleted inode is a directory, it decreases the
s_dirs_counter field in the
ext2_sb_info data structure,
sets the s_dirt flag of the
superblock to 1, and marks the buffer that contains it as
dirty.
清除inode位图中与磁盘inode对应的位,并将包含该位图的缓冲区标记为脏。此外,如果文件系统已使用该MS_SYNCHRONIZE标志挂载,则它会调用sync_dirty_buffer( )等待,直到位图缓冲区上的写入操作终止。
Clears the bit corresponding to the disk inode in the inode
bitmap and marks the buffer that contains the bitmap as dirty.
Moreover, if the filesystem has been mounted with the MS_SYNCHRONIZE flag, it invokes sync_dirty_buffer( ) to wait until the
write operation on the bitmap's buffer terminates.
每个非空常规文件由一组数据块组成。这些块可以通过它们在文件内的相对位置(它们的文件块号)或它们在磁盘分区内的位置(它们的逻辑块号)来引用(参见第 14 章中的“块设备处理”部分)。
Each nonempty regular file consists of a group of data blocks . Such blocks may be referred to either by their relative position inside the file —their file block number—or by their position inside the disk partition—their logical block number (see the section "Block Devices Handling" in Chapter 14).
从文件内的偏移量f导出相应数据块的逻辑块号是一个两步过程:
Deriving the logical block number of the corresponding data block from an offset f inside a file is a two-step process:
从偏移量f导出文件块号 — 包含偏移量f处的字符的块的索引。
Derive from the offset f the file block number — the index of the block that contains the character at offset f.
将文件块号转换为相应的逻辑块号。
Translate the file block number to the corresponding logical block number.
因为 Unix 文件不包含任何控制字符,所以很容易导出包含文件第 f个字符的文件块号:只需取f与文件系统块大小的商,然后向下舍入到最接近的整数。
Because Unix files do not include any control characters, it is quite easy to derive the file block number containing the f th character of a file: simply take the quotient of f and the filesystem's block size and round down to the nearest integer.
例如,假设块大小为 4 KB。如果 f小于 4,096,则该字符包含在文件的第一个数据块中,该数据块的文件块号为 0。如果f等于或大于 4,096 且小于 8,192,则该字符包含在文件块号为 0 的数据块中。文件块号为 1,依此类推。
For instance, let's assume a block size of 4 KB. If f is smaller than 4,096, the character is contained in the first data block of the file, which has file block number 0. If f is equal to or greater than 4,096 and less than 8,192, the character is contained in the data block that has file block number 1, and so on.
就文件块编号而言这很好很担心。然而,将文件块号转换为相应的逻辑块号并不那么简单,因为 Ext2 文件的数据块不一定在磁盘上相邻。
This is fine as far as file block numbers are concerned. However, translating a file block number into the corresponding logical block number is not nearly as straightforward, because the data blocks of an Ext2 file are not necessarily adjacent on disk.
因此,Ext2 文件系统必须提供一种方法来存储每个文件块号与磁盘上相应逻辑块号之间的联系。这种映射可以追溯到 AT&T 的 Unix 早期版本,部分是在 inode 内部实现的。它还涉及一些包含额外指针的专用块,这些指针是用于处理大文件的索引节点扩展。
The Ext2 filesystem must therefore provide a method to store the connection between each file block number and the corresponding logical block number on disk. This mapping, which goes back to early versions of Unix from AT&T, is implemented partly inside the inode. It also involves some specialized blocks that contain extra pointers, which are an inode extension used to handle large files.
i_block磁盘索引节点中的字段是包含逻辑块号的组件EXT2_N_BLOCKS
数组。在下面的讨论中,我们假设具有默认值,即 15。该数组表示较大数据结构的初始部分,如图 18-5EXT2_N_BLOCKS所示。从图中可以看出,数组的 15 个分量有 4 种不同的类型:
The i_block field in the disk
inode is an array of EXT2_N_BLOCKS
components that contain logical block numbers. In the following
discussion, we assume that EXT2_N_BLOCKS has the default value, namely
15. The array represents the initial part of a larger data structure,
which is illustrated in Figure 18-5. As can be seen
in the figure, the 15 components of the array are of 4 different
types:
前 12 个分量产生与文件的前 12 个块相对应的逻辑块号,即文件块号从 0 到 11 的块。
The first 12 components yield the logical block numbers corresponding to the first 12 blocks of the file—to the blocks that have file block numbers from 0 to 11.
索引 12 处的组件包含块的逻辑块编号,称为间接块,它表示逻辑块编号的二阶数组。它们对应的文件块号范围为12到 b /4+11,其中b是文件系统的块大小(每个逻辑块号存储在4个字节中,因此我们在公式中除以4)。因此,内核必须在此组件中查找指向块的指针,然后在该块中查找另一个指向包含文件内容的最终块的指针。
The component at index 12 contains the logical block number of a block, called indirect block, that represents a second-order array of logical block numbers. They correspond to the file block numbers ranging from 12 to b/4+11, where b is the filesystem's block size (each logical block number is stored in 4 bytes, so we divide by 4 in the formula). Therefore, the kernel must look in this component for a pointer to a block, and then look in that block for another pointer to the ultimate block that contains the file contents.
索引 13 处的组件包含间接块的逻辑块号,该间接块包含逻辑块号的二阶数组;依次,该二阶数组的条目指向三阶数组,该数组存储与文件块号对应的逻辑块号,范围从 b /4+12 到 ( b /4) 2 +( b /4 )+11。
The component at index 13 contains the logical block number of an indirect block containing a second-order array of logical block numbers; in turn, the entries of this second-order array point to third-order arrays, which store the logical block numbers that correspond to the file block numbers ranging from b/4+12 to (b/4)2+(b/4)+11.
最后,索引 14 处的组件使用三重间接寻址:四阶数组存储与文件块编号对应的逻辑块编号,范围从 ( b /4) 2 +( b /4)+12 到 ( b /4) 3 +( b /4) 2+ ( b /4)+11。
Finally, the component at index 14 uses triple indirection: the fourth-order arrays store the logical block numbers corresponding to the file block numbers ranging from (b/4)2+(b/4)+12 to (b/4)3+(b/4)2+(b/4)+11.
图18-5中,块内的数字代表对应的文件块号。箭头代表存储在数组组件中的逻辑块编号,显示内核如何通过间接块找到包含文件实际内容的块。
In Figure 18-5, the number inside a block represents the corresponding file block number. The arrows, which represent logical block numbers stored in array components, show how the kernel finds its way through indirect blocks to reach the block that contains the actual contents of the file.
请注意此机制如何有利于小文件。如果文件不需要超过12个数据块,则可以通过两次磁盘访问来检索每个数据:一次读取i_block磁盘inode阵列中的组件,另一次读取所请求的数据块。然而,对于较大的文件,可能需要连续三次甚至四次磁盘访问才能访问所需的块。实际上,这是最坏情况的估计,因为 dentry、inode 和页缓存对减少实际磁盘访问次数有很大贡献。
Notice how this mechanism favors small files. If the file does
not require more than 12 data blocks, every data can be retrieved in
two disk accesses: one to read a component in the i_block array of the disk inode and the
other to read the requested data block. For larger files, however,
three or even four consecutive disk accesses may be needed to access
the required block. In practice, this is a worst-case estimate,
because dentry, inode, and page caches contribute significantly to
reduce the number of real disk accesses.
还要注意文件系统的块大小如何影响寻址机制,因为较大的块大小允许 Ext2 在单个块内存储更多的逻辑块号。表 18-11显示了每种块大小和每种寻址模式的文件大小上限。例如,如果块大小为 1,024 字节,文件最多包含 268 KB 数据,则可以通过直接映射访问文件的前 12 KB,而可以通过简单的间接寻址来访问剩余的 13-268 KB。大于 2 GB 的文件必须通过指定O_LARGEFILE打开标志在 32 位体系结构上打开。
Notice also how the block size of the filesystem affects the
addressing mechanism, because a larger block size allows the Ext2 to
store more logical block numbers inside a single block. Table 18-11 shows the
upper limit placed on a file's size for each block size and each
addressing mode. For instance, if the block size is 1,024 bytes and
the file contains up to 268 kilobytes of data, the first 12 KB of a
file can be accessed through direct mapping and the remaining 13-268
KB can be addressed through simple indirection. Files larger than 2 GB
must be opened on 32-bit architectures by specifying the O_LARGEFILE opening flag.
表 18-11。数据块寻址的文件大小上限
Table 18-11. File-size upper limits for data block addressing
块大小 Block size | 直接的 Direct | 1-间接 1-Indirect | 2-间接 2-Indirect | 3-间接 3-Indirect |
|---|---|---|---|---|
1,024 1,024 | 12KB 12 KB | 268 KB 268 KB | 64.26 MB 64.26 MB | 16.06GB 16.06 GB |
2,048 2,048 | 24KB 24 KB | 1.02MB 1.02 MB | 513.02 MB 513.02 MB | 256.5GB 256.5 GB |
4,096 4,096 | 48KB 48 KB | 4.04MB 4.04 MB | 4GB 4 GB | 〜4TB ~ 4 TB |
文件漏洞是常规文件的一部分,其中包含空字符,并且不存储在磁盘上的任何数据块中。漏洞是 Unix 文件的一个长期存在的特征。例如,以下 Unix 命令创建一个文件,其中第一个字节是一个空洞:
A file hole is a portion of a regular file that contains null characters and is not stored in any data block on disk. Holes are a long-standing feature of Unix files. For instance, the following Unix command creates a file in which the first bytes are a hole:
$ echo -n “X”| dd of=/tmp/hole bs=1024 寻求=6
$ echo -n "X" | dd of=/tmp/hole bs=1024 seek=6
现在/tmp/hole有 6,145 个字符(6,144 个空字符加上一个 X 字符),但该文件仅占用磁盘上的一个数据块。
Now /tmp/hole has 6,145 characters (6,144 null characters plus an X character), yet the file occupies just one data block on disk.
引入文件洞是为了避免浪费磁盘空间。它们被数据库应用程序广泛使用,更一般地,被所有对文件执行散列的应用程序使用。
File holes were introduced to avoid wasting disk space. They are used extensively by database applications and, more generally, by all applications that perform hashing on files.
文件洞的Ext2实现 基于动态数据块分配:只有当进程需要向文件写入数据时,才会将块实际分配给文件。每个inode的字段i_size定义了程序所看到的文件的大小,包括孔,而该字段i_blocks存储有效分配给文件的数据块的数量(以512字节为单位)。
The Ext2 implementation of file holes is based on dynamic data block allocation: a block is
actually assigned to a file only when the process needs to write data
into it. The i_size field of each
inode defines the size of the file as seen by the program, including
the holes, while the i_blocks field
stores the number of data blocks effectively assigned to the file (in
units of 512 bytes).
在我们前面的命令示例中dd,假设/tmp/hole文件是在块大小为 4,096 的 Ext2 分区上创建的。对应的磁盘inode的字段i_size存储的是数字6145,而i_blocks字段存储的是数字8(因为每个4096字节的块包含8个512字节的块)。数组的第二个元素i_block(对应于文件块号为1的块)存储分配块的逻辑块号,而数组中的所有其他元素均为空(见图18-6 )。
In our earlier example of the dd command, suppose the /tmp/hole file was created on an Ext2
partition that has blocks of size 4,096. The i_size field of the corresponding disk inode
stores the number 6,145, while the i_blocks field stores the number 8 (because
each 4,096-byte block includes eight 512-byte blocks). The second
element of the i_block array
(corresponding to the block having file block number 1) stores the
logical block number of the allocated block, while all other elements
in the array are null (see Figure 18-6).
当内核必须找到保存 Ext2 常规文件数据的块时,它会调用该ext2_get_block( )函数。如果该块不存在,该函数会自动将该块分配给文件。请记住,每次内核对 Ext2 常规文件发出读或写操作时,都可能会调用此函数(请参阅第 16 章中的“从文件读取”和“写入文件”部分);显然,只有当受影响的块不包含在页面缓存中时才会调用此函数。
When the kernel has to locate a block holding data for
an Ext2 regular file, it invokes the ext2_get_block( ) function. If the block
does not exist, the function automatically allocates the block to the
file. Remember that this function may be invoked every time the kernel
issues a read or write operation on an Ext2 regular file (see the
sections "Reading from a
File" and "Writing to a File" in Chapter 16); clearly, this
function is invoked only if the affected block is not included in the
page cache.
该函数处理“数据块寻址ext2_get_block( )”
部分中已描述的数据结构,并在必要时调用该函数来实际搜索 Ext2 分区中的空闲块。如有必要,该函数还会分配用于间接寻址的块(见图
18-5)。ext2_alloc_block( )
The ext2_get_block( )
function handles the data structures already described in the section
"Data Blocks
Addressing," and when necessary, invokes the ext2_alloc_block( ) function to actually
search for a free block in the Ext2 partition. If necessary, the
function also allocates the blocks used for indirect addressing (see
Figure 18-5).
为了减少文件碎片,Ext2 文件系统尝试在已分配给该文件的最后一个块附近为该文件获取一个新块。如果失败,文件系统会在包含文件 inode 的块组中搜索新块。作为最后的手段,空闲块是从其他块组之一中获取的。
To reduce file fragmentation, the Ext2 filesystem tries to get a new block for a file near the last block already allocated for the file. Failing that, the filesystem searches for a new block in the block group that includes the file's inode. As a last resort, the free block is taken from one of the other block groups.
Ext2 文件系统使用数据块的预分配。该文件不仅获取所请求的块,而且获取最多八个相邻块的组。该结构i_prealloc_count中的字段ext2_inode_info存储预分配给文件的尚未使用的数据块的数量,该
i_prealloc_block字段存储下一个要使用的预分配块的逻辑块号。当文件关闭、截断或写入操作相对于触发块预分配的写入操作不连续时,将释放所有未使用的预分配块。
The Ext2 filesystem uses preallocation of data blocks. The file
does not get only the requested block, but rather a group of up to
eight adjacent blocks. The i_prealloc_count field in the ext2_inode_info structure stores the number
of data blocks preallocated to a file that are still unused, and the
i_prealloc_block field stores the
logical block number of the next preallocated block to be used. All
preallocated blocks that remain unused are freed when the file is
closed, when it is truncated, or when a write operation is not
sequential with respect to the write operation that triggered the
block preallocation.
该ext2_alloc_block( )
函数接收一个指向 inode 对象的指针作为其参数,一个
目标 ,以及将存储错误代码的变量的地址。目标是代表新块的首选位置的逻辑块号。该ext2_get_block( )函数根据以下启发式设置目标参数:
The ext2_alloc_block( )
function receives as its parameters a pointer to an inode object, a
goal , and the address of a variable that will store an
error code. The goal is a logical block number that represents the
preferred position of the new block. The ext2_get_block( ) function sets the goal
parameter according to the following heuristic:
如果正在分配的块和之前分配的块有连续的文件块号,则目标是前一个块的逻辑块号加1;程序所看到的连续块在磁盘上应该是相邻的,这是有道理的。
If the block that is being allocated and the previously allocated block have consecutive file block numbers, the goal is the logical block number of the previous block plus 1; it makes sense that consecutive blocks as seen by a program should be adjacent on disk.
如果第一条规则不适用并且先前已将至少一个块分配给该文件,则目标是这些块的逻辑块号之一。更准确地说,它是文件中要分配的块之前的已分配块的逻辑块号。
If the first rule does not apply and at least one block has been previously allocated to the file, the goal is one of these blocks' logical block numbers. More precisely, it is the logical block number of the already allocated block that precedes the block to be allocated in the file.
如果上述规则不适用,则目标是包含文件 inode 的块组中第一个块(不一定是空闲块)的逻辑块号。
If the preceding rules do not apply, the goal is the logical block number of the first block (not necessarily free) in the block group that contains the file's inode.
该ext2_alloc_block( )
函数检查目标是否引用文件的预分配块之一。如果是,则分配相应的块并返回其逻辑块号;否则,该函数将丢弃所有剩余的预分配块并调用ext2_new_block( ).
The ext2_alloc_block( )
function checks whether the goal refers to one of the preallocated
blocks of the file. If so, it allocates the corresponding block and
returns its logical block number; otherwise, the function discards all
remaining preallocated blocks and invokes ext2_new_block( ).
后一个函数使用以下策略在 Ext2 分区内搜索空闲块:
This latter function searches for a free block inside the Ext2 partition with the following strategy:
如果传递给ext2_alloc_block( )目标块的首选块是空闲的,则该函数会分配该块。
If the preferred block passed to ext2_alloc_block( )—the block that is
the goal—is free, the function allocates the block.
如果目标繁忙,该函数将检查首选块之后的下一个块是否空闲。
If the goal is busy, the function checks whether one of the next blocks after the preferred block is free.
如果在首选块附近没有找到空闲块,则该函数会考虑所有块组,从包含目标的块组开始。对于每个块组,该函数执行以下操作:
查找一组至少有八个相邻的空闲块。
如果没有找到这样的组,则查找单个空闲块。
If no free block is found in the near vicinity of the preferred block, the function considers all block groups, starting from the one including the goal. For each block group, the function does the following:
Looks for a group of at least eight adjacent free blocks.
If no such group is found, looks for a single free block.
一旦找到空闲块,搜索就会结束。在终止之前,该ext2_new_block( )
函数还会尝试预分配与找到的空闲块相邻的最多八个空闲块,并将磁盘 inode 的i_prealloc_block和i_prealloc_count字段设置为正确的块位置和块数。
The search ends as soon as a free block is found. Before
terminating, the ext2_new_block( )
function also tries to preallocate up to eight free blocks adjacent to
the free block found and sets the i_prealloc_block and i_prealloc_count fields of the disk inode to
the proper block location and number of blocks.
当进程删除文件或将其截断为 0 长度时,必须回收其所有数据块。这是由 完成的
ext2_truncate( ),它接收文件 inode 对象的地址作为其参数。该函数本质上是扫描磁盘索引i_block节点阵列以定位所有数据块和用于间接寻址的所有块。然后通过重复调用来释放这些块ext2_free_blocks( )。
When a process deletes a file or truncates it to 0
length, all its data blocks must be reclaimed. This is done by
ext2_truncate( ), which receives
the address of the file's inode object as its parameter. The function
essentially scans the disk inode's i_block array to locate all data blocks and
all blocks used for the indirect addressing. These blocks are then
released by repeatedly invoking ext2_free_blocks( ).
该ext2_free_blocks( )
函数释放一组一个或多个相邻的数据块。除了由 调用之外ext2_truncate( ),该函数主要在丢弃文件的预分配块时被调用(请参阅前面的“分配数据块”部分)。其参数为:
The ext2_free_blocks( )
function releases a group of one or more adjacent data blocks. Besides
its use by ext2_truncate( ), the
function is invoked mainly when discarding the preallocated blocks of
a file (see the earlier section "Allocating a Data
Block"). Its parameters are:
inodeinode描述文件的inode对象的地址
The address of the inode object that describes the file
blockblock第一个要释放的块的逻辑块号
The logical block number of the first block to be released
countcount待释放的相邻块数
The number of adjacent blocks to be released
该函数对每个要释放的块执行以下操作:
The function performs the following actions for each block to be released:
获取包含要释放的块的块组的块位图
Gets the block bitmap of the block group that includes the block to be released
清除块位图中与要释放的块对应的位,并将包含该位图的缓冲区标记为脏。
Clears the bit in the block bitmap that corresponds to the block to be released and marks the buffer that contains the bitmap as dirty.
增加bg_free_blocks_count块组描述符中的字段并将相应的缓冲区标记为脏。
Increases the bg_free_blocks_count field in the block
group descriptor and marks the corresponding buffer as
dirty.
增加s_free_blocks_count磁盘超级块的字段,将相应的缓冲区标记为脏,并设置
s_dirt超级块对象的标志。
Increases the s_free_blocks_count field of the disk
superblock, marks the corresponding buffer as dirty, and sets the
s_dirt flag of the superblock
object.
如果文件系统已安装并MS_SYNCHRONOUS设置了标志,则它会调用
sync_dirty_buffer( )并等待,直到位图缓冲区上的写入操作终止。
If the filesystem has been mounted with the MS_SYNCHRONOUS flag set, it invokes
sync_dirty_buffer( ) and waits
until the write operation on the bitmap's buffer
terminates.
在本节中,我们将简要描述从 Ext2 发展而来的增强文件系统,名为Ext3。新文件系统的设计考虑了两个简单的概念:
In this section we'll briefly describe the enhanced filesystem that has evolved from Ext2, named Ext3. The new filesystem has been designed with two simple concepts in mind:
成为日志文件系统(参见下一节)
To be a journaling filesystem (see the next section)
尽可能与旧的 Ext2 文件系统兼容
To be, as much as possible, compatible with the old Ext2 filesystem
Ext3 很好地实现了这两个目标。特别是,它很大程度上基于 Ext2,因此它在磁盘上的数据结构基本上与 Ext2 文件系统的数据结构相同。事实上,如果 Ext3 文件系统已被完全卸载,它可以作为 Ext2 文件系统重新安装;相反,创建 Ext2 文件系统的日志并将其重新挂载为 Ext3 文件系统是一个简单、快速的操作。
Ext3 achieves both the goals very well. In particular, it is largely based on Ext2, so its data structures on disk are essentially identical to those of an Ext2 filesystem. As a matter of fact, if an Ext3 filesystem has been cleanly unmounted, it can be remounted as an Ext2 filesystem; conversely, creating a journal of an Ext2 filesystem and remounting it as an Ext3 filesystem is a simple, fast operation.
由于 Ext3 和 Ext2 之间的兼容性,本章前面部分中的大多数描述也适用于 Ext3。因此,在本节中,我们重点关注 Ext3 提供的新功能——“日志”。
Thanks to the compatibility between Ext3 and Ext2, most descriptions in the previous sections of this chapter apply to Ext3 as well. Therefore, in this section, we focus on the new feature offered by Ext3 — "the journal."
随着磁盘变得越来越大,传统 Unix 文件系统(例如 Ext2)的一种设计选择被证明是不合适的。正如我们从第 14 章中知道的那样,对文件系统块的更新可能会在刷新到磁盘之前在动态内存中保存很长一段时间。因此,诸如断电故障或系统崩溃之类的重大事件可能会使文件系统处于不一致的状态。为了克服这个问题,每个传统的 Unix 文件系统在安装之前都要进行检查;如果未正确卸载,则特定程序将执行详尽且耗时的检查并修复磁盘上所有文件系统的数据结构。
As disks became larger, one design choice of traditional Unix filesystems (such as Ext2) turns out to be inappropriate. As we know from Chapter 14, updates to filesystem blocks might be kept in dynamic memory for long period of time before being flushed to disk. A dramatic event such as a power-down failure or a system crash might thus leave the filesystem in an inconsistent state. To overcome this problem, each traditional Unix filesystem is checked before being mounted; if it has not been properly unmounted, then a specific program executes an exhaustive, time-consuming check and fixes all the filesystem's data structures on disk.
例如,Ext2 文件系统状态存储在
s_mount_state磁盘上超级块的字段中。
引导脚本调用e2fsck实用程序来检查该字段中存储的值;如果它不等于EXT2_VALID_FS,则文件系统未正确卸载,因此e2fsck开始检查文件系统的所有磁盘数据结构。
For instance, the Ext2 filesystem status is stored in the
s_mount_state field of the
superblock on disk. The e2fsck
utility program is invoked by the boot script to check the value
stored in this field; if it is not equal to EXT2_VALID_FS, the filesystem was not
properly unmounted, and therefore e2fsck starts checking all disk data
structures of the filesystem.
显然,检查文件系统一致性所花费的时间主要取决于要检查的文件和目录的数量;因此,它还取决于磁盘大小。如今,随着文件系统达到数百GB,一次一致性检查可能需要几个小时。对于每个生产环境或高可用性服务器来说,所涉及的停机时间都是不可接受的。
Clearly, the time spent checking the consistency of a filesystem depends mainly on the number of files and directories to be examined; therefore, it also depends on the disk size. Nowadays, with filesystems reaching hundreds of gigabytes, a single consistency check may take hours. The involved downtime is unacceptable for every production environment or high-availability server.
日志文件系统的目标是通过查看包含名为journal的最新磁盘写入操作的特殊磁盘区域来避免对整个文件系统运行耗时的一致性检查。系统故障后重新挂载日志文件系统只需几秒钟的时间。
The goal of a journaling filesystem is to avoid running time-consuming consistency checks on the whole filesystem by looking instead in a special disk area that contains the most recent disk write operations named journal. Remounting a journaling filesystem after a system failure is a matter of a few seconds.
Ext3 日志记录背后的想法是分两步执行文件系统的每个高级更改。首先,将要写入的块的副本存储在日志中;然后,当I/O数据传输到日志完成时(简而言之,数据被 提交到日志),块被写入文件系统。当到文件系统的 I/O 数据传输终止(数据提交到文件系统)时,日志中块的副本将被丢弃。
The idea behind Ext3 journaling is to perform each high-level change to the filesystem in two steps. First, a copy of the blocks to be written is stored in the journal; then, when the I/O data transfer to the journal is completed (in short, data is committed to the journal), the blocks are written in the filesystem. When the I/O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded.
在系统故障后恢复时,e2fsck程序区分以下两种情况:
While recovering after a system failure, the e2fsck program distinguishes the following two cases:
日志中缺少与高级更改相关的块的副本,或者它们不完整;在这两种情况下,e2fsck都会忽略它们。
Either the copies of the blocks relative to the high-level change are missing from the journal or they are incomplete; in both cases, e2fsck ignores them.
块的副本是有效的,e2fsck将它们写入文件系统。
The copies of the blocks are valid, and e2fsck writes them into the filesystem.
在第一种情况下,对文件系统的高级更改丢失,但文件系统状态仍然一致。在第二种情况下,e2fsck应用整个高级更改,从而修复由于未完成的 I/O 数据传输到文件系统而导致的所有不一致。
In the first case, the high-level change to the filesystem is lost, but the filesystem state is still consistent. In the second case, e2fsck applies the whole high-level change, thus fixing every inconsistency due to unfinished I/O data transfers into the filesystem.
不要对日志文件系统期望太高;它仅确保系统调用级别的一致性。例如,当您通过发出多个命令来复制大文件时发生系统故障write( ) 系统调用会中断复制操作,因此复制的文件将比原始文件短。
Don't expect too much from a journaling filesystem; it ensures
consistency only at the system call level. For instance, a system
failure that occurs while you are copying a large file by issuing
several write( ) system calls will interrupt the copy operation, thus
the duplicated file will be shorter than the original one.
此外,日志文件系统通常不会将所有块复制到日志中。事实上,每个文件系统都由两种块组成:包含所谓的 元数据的块 以及包含常规数据的数据。对于Ext2和Ext3来说,有六种元数据:超级块、组块描述符、索引节点、用于间接寻址的块(间接块)、数据位图块和索引节点位图块。其他文件系统可能使用不同的元数据。
Furthermore, journaling filesystems do not usually copy all blocks into the journal. In fact, each filesystem consists of two kinds of blocks: those containing the so-called metadata and those containing regular data. In the case of Ext2 and Ext3, there are six kinds of metadata: superblocks, group block descriptors, inodes, blocks used for indirect addressing (indirection blocks), data bitmap blocks, and inode bitmap blocks. Other filesystems may use different metadata.
多种日志文件系统,例如 SGI 的 XFS 和IBM的JFS ,限制自己记录影响元数据的操作。事实上,元数据的日志记录足以恢复磁盘文件系统数据结构的一致性。但是,由于不会记录对文件数据块的操作,因此无法阻止系统故障损坏文件内容。
Several journaling filesystems, such as SGI's XFS and IBM's JFS , limit themselves to logging the operations affecting metadata. In fact, metadata's log records are sufficient to restore the consistency of the on-disk filesystem data structures. However, since operations on blocks of file data are not logged, nothing prevents a system failure from corrupting the contents of the files.
然而,Ext3 文件系统可以配置为记录影响文件系统元数据和文件数据块的操作。由于记录每种写入操作都会导致显着的性能损失,因此 Ext3 允许系统管理员决定必须记录哪些内容;特别是,它提供了三种不同的日记模式:
The Ext3 filesystem, however, can be configured to log the operations affecting both the filesystem metadata and the data blocks of the files. Because logging every kind of write operation leads to a significant performance penalty, Ext3 lets the system administrator decide what has to be logged; in particular, it offers three different journaling modes :
所有文件系统数据和元数据更改都记录到日志中。此模式最大限度地减少了丢失对每个文件所做的更新的可能性,但它需要许多额外的磁盘访问。例如,当创建一个新文件时,它的所有数据块都必须复制为日志记录。这是最安全且最慢的 Ext3 日志模式。
All filesystem data and metadata changes are logged into the journal. This mode minimizes the chance of losing the updates made to each file, but it requires many additional disk accesses. For example, when a new file is created, all its data blocks must be duplicated as log records. This is the safest and slowest Ext3 journaling mode.
只有文件系统元数据的更改才会记录到日志中。然而,Ext3 文件系统将元数据和相关数据块分组,以便数据块在元数据之前写入磁盘 。这样,文件内数据损坏的机会就减少了;例如,每次扩大文件的写访问都保证受到日志的完全保护。这是默认的 Ext3 日志模式。
Only changes to filesystem metadata are logged into the journal. However, the Ext3 filesystem groups metadata and relative data blocks so that data blocks are written to disk before the metadata. This way, the chance to have data corruption inside the files is reduced; for instance, each write access that enlarges a file is guaranteed to be fully protected by the journal. This is the default Ext3 journaling mode.
仅记录文件系统元数据的更改;这是其他日志文件系统上的方法,也是最快的模式。
Only changes to filesystem metadata are logged; this is the method found on the other journaling filesystems and is the fastest mode.
Ext3 文件系统的日志模式由mount系统命令的选项指定。例如,要使用“writeback”模式将存储在 /dev/sda2分区 中的 Ext3 文件系统挂载到/jdisk挂载点上,系统管理员可以键入以下命令:
The journaling mode of the Ext3 filesystem is specified by an option of the mount system command. For instance, to mount an Ext3 filesystem stored in the /dev/sda2 partition on the /jdisk mount point with the "writeback" mode, the system administrator can type the command:
# mount -t ext3 -o data=writeback /dev/sda2 /jdisk
# mount -t ext3 -o data=writeback /dev/sda2 /jdisk
Ext3 日志通常存储在文件系统根目录中名为.journal的隐藏文件中。
The Ext3 journal is usually stored in a hidden file named .journal located in the root directory of the filesystem.
Ext3 文件系统本身不处理日志;相反,它使用名为Journaling Block Device或JBD 的通用内核层。目前,只有 Ext3 使用 JBD 层,但其他文件系统将来可能会使用它。
The Ext3 filesystem does not handle the journal on its own; rather, it uses a general kernel layer named Journaling Block Device, or JBD. Right now, only Ext3 uses the JBD layer, but other filesystems might use it in the future.
JBD 层是一个相当复杂的软件。Ext3 文件系统调用 JBD 例程以确保其后续操作在系统故障时不会损坏磁盘数据结构。然而,JBD 通常使用相同的磁盘来记录 Ext3 文件系统执行的更改,因此它与 Ext3 一样容易受到系统故障的影响。换句话说,JBD 还必须保护自己免受可能损坏日志的系统故障的影响。
The JBD layer is a rather complex piece of software. The Ext3 filesystem invokes the JBD routines to ensure that its subsequent operations don't corrupt the disk data structures in case of system failure. However, JBD typically uses the same disk to log the changes performed by the Ext3 filesystem, and it is therefore vulnerable to system failures as much as Ext3. In other words, JBD must also protect itself from system failures that could corrupt the journal.
因此,Ext3和JBD之间的交互本质上基于三个基本单元:
Therefore, the interaction between Ext3 and JBD is essentially based on three fundamental units:
描述日志文件系统磁盘块的单个更新。
Describes a single update of a disk block of the journaling filesystem.
Includes log records relative to a single high-level change of the filesystem; typically, each system call modifying the filesystem gives rise to a single atomic operation handle.
包括多个原子操作句柄,其日志记录同时标记为对e2fsck有效。
Includes several atomic operation handles whose log records are marked valid for e2fsck at the same time.
日志记录本质上是对文件系统将要发出的低级操作的描述。在某些日志文件系统中,日志记录恰好由操作修改的字节范围以及文件系统内字节的起始位置组成。然而,JBD 层使用由低级操作修改的整个缓冲区组成的日志记录。这种方法可能会浪费大量的日志空间(例如,当低级操作只是更改位图中的某个位的值时),但它也更快,因为 JBD 层可以直接使用缓冲区及其缓冲区头。
A log record is essentially the description of a low-level operation that is going to be issued by the filesystem. In some journaling filesystems, the log record consists of exactly the span of bytes modified by the operation, together with the starting position of the bytes inside the filesystem. The JBD layer, however, uses log records consisting of the whole buffer modified by the low-level operation. This approach may waste a lot of journal space (for instance, when the low-level operation just changes the value of a bit in a bitmap), but it is also much faster because the JBD layer can work directly with buffers and their buffer heads.
因此,日志记录在日志中表示为正常的数据块(或元数据)。然而,每个这样的块都与一个类型为 的小标记相关联journal_block_tag_t,该标记存储文件系统内块的逻辑块号和一些状态标志。
Log records are thus represented inside the journal as normal
blocks of data (or metadata). Each such block, however, is
associated with a small tag of type journal_block_tag_t, which stores the
logical block number of the block inside the filesystem and a few
status flags.
之后,每当 JBD 考虑某个缓冲区时,无论是因为它属于日志记录,还是因为它是一个数据块,应该在相应的元数据块之前刷新到磁盘(在“有序”日志模式下),内核将journal_head数据结构附加到缓冲区头。在这种情况下,b_private缓冲区头的字段存储journal_head
数据结构的地址,并且设置标志(参见第15章中的“块缓冲区和缓冲区头”BH_JBD
部分)。
Later, whenever a buffer is being considered by the JBD,
either because it belongs to a log record or because it is a data
block that should be flushed to disk before the corresponding
metadata block (in the "ordered" journaling mode), the kernel
attaches a journal_head data
structure to the buffer head. In this case, the b_private field of the buffer head stores
the address of the journal_head
data structure and the BH_JBD
flag is set (see the section "Block Buffers and Buffer
Heads" in Chapter
15).
每个修改文件系统的系统调用通常分为一系列操作磁盘数据结构的低级操作。
Every system call modifying the filesystem is usually split into a series of low-level operations that manipulate disk data structures.
例如,假设 Ext3 必须满足用户将数据块附加到常规文件的请求。文件系统层必须确定文件的最后一个块,在文件系统中定位空闲块,更新正确块组内的数据块位图,将新块的逻辑号存储在文件的索引节点或间接寻址块中,写入新块的内容,最后更新inode的几个字段。如您所见,追加操作转换为对文件系统的数据和元数据块的许多较低级别的操作。
For instance, suppose that Ext3 must satisfy a user request to append a block of data to a regular file. The filesystem layer must determine the last block of the file, locate a free block in the filesystem, update the data block bitmap inside the proper block group, store the logical number of the new block either in the file's inode or in an indirect addressing block, write the contents of the new block, and finally, update several fields of the inode. As you see, the append operation translates into many lower-level operations on the data and metadata blocks of the filesystem.
现在,想象一下,如果在追加操作过程中发生系统故障,而某些较低级别的操作已经执行而其他操作尚未执行,会发生什么情况。当然,情况可能会更糟,高级操作会影响两个或多个文件(例如,将文件从一个目录移动到另一个目录)。
Now, just imagine what could happen if a system failure occurred in the middle of an append operation, when some of the lower-level manipulations have already been executed while others have not. Of course, the scenario could be even worse, with high-level operations affecting two or more files (for example, moving a file from one directory to another).
为了防止数据损坏,Ext3 文件系统必须确保每个系统调用都以原子方式处理。原子 操作句柄是磁盘数据结构上的一组低级操作,对应于单个高级操作。当从系统故障中恢复时,文件系统确保要么应用整个高级操作,要么不应用任何低级操作。
To prevent data corruption, the Ext3 filesystem must ensure that each system call is handled in an atomic way. An atomic operation handle is a set of low-level operations on the disk data structures that correspond to a single high-level operation. When recovering from a system failure, the filesystem ensures that either the whole high-level operation is applied or none of its low-level operations is.
每个原子操作句柄都由类型为 的描述符表示handle_t。为了启动原子操作,Ext3 文件系统调用journal_start( )JBD 函数,该函数会在必要时分配一个新的原子操作句柄并将其插入到当前事务中(请参阅下一节)。由于磁盘上的每个低级操作都可能挂起进程,因此活动句柄的地址存储在journal_info进程描述符的字段中。为了通知原子操作已完成,Ext3 文件系统调用该journal_stop(
)函数。
Each atomic operation handle is represented by a descriptor of
type handle_t. To start an atomic
operation, the Ext3 filesystem invokes the journal_start( ) JBD function, which
allocates, if necessary, a new atomic operation handle and inserts
it into the current transactions (see the next section). Because
every low-level operation on the disk might suspend the process, the
address of the active handle is stored in the journal_info field of the process
descriptor. To notify that an atomic operation is completed, the
Ext3 filesystem invokes the journal_stop(
) function.
出于效率的考虑,JBD层通过将属于多个原子操作句柄的日志记录分组到单个事务中来管理日志。此外,与句柄相关的所有日志记录必须包含在同一事务中。
For reasons of efficiency, the JBD layer manages the journal by grouping the log records that belong to several atomic operation handles into a single transaction. Furthermore, all log records relative to a handle must be included in the same transaction.
事务的所有日志记录都存储在日志的连续块中。JBD 层将每个事务作为一个整体进行处理。例如,只有在其日志记录中包含的所有数据都提交到文件系统后,它才会回收事务使用的块。
All log records of a transaction are stored in consecutive blocks of the journal. The JBD layer handles each transaction as a whole. For instance, it reclaims the blocks used by a transaction only after all data included in its log records is committed to the filesystem.
一旦创建,事务就可以接受新句柄的日志记录。当发生以下任一情况时,事务将停止接受新句柄:
As soon as it is created, a transaction may accept log records of new handles. The transaction stops accepting new handles when either of the following occurs:
已过去固定的时间,通常为 5 秒。
A fixed amount of time has elapsed, typically 5 seconds.
日志中没有剩余的空闲块可供新句柄使用。
There are no free blocks in the journal left for a new handle.
事务由 类型的描述符表示transaction_t。最重要的字段是
t_state,它描述了交易的当前状态。
A transaction is represented by a descriptor of type transaction_t. The most important field is
t_state, which describes the
current status of the transaction.
本质上,交易可以是:
Essentially, a transaction can be:
事务中包含的所有日志记录均已物理写入日志中。当从系统故障中恢复时,e2fsck
会考虑日志的每个完整事务并将相应的块写入文件系统。在本例中,该t_state字段存储值T_FINISHED。
All log records included in the transaction have been
physically written onto the journal. When recovering from a
system failure, e2fsck
considers every complete transaction of the journal and writes
the corresponding blocks into the filesystem. In this case,
the t_state field stores
the value T_FINISHED.
事务中包含的至少一条日志记录尚未物理写入日志,或者新的日志记录仍在添加到事务中。如果系统出现故障,日志中存储的交易图像可能不是最新的。因此,当从系统故障中恢复时,e2fsck不信任不完整的事务在日记中并跳过它们。在这种情况下,该t_state字段存储以下值之一:
T_RUNNING仍然接受新的原子操作句柄。
T_LOCKED不接受新的原子操作句柄,但其中一些尚未完成。
T_FLUSH所有原子操作句柄已完成,但一些日志记录仍在写入日志。
T_COMMIT原子操作句柄的所有日志记录均已写入磁盘,但事务尚未在日志上标记为已完成。
At least one log record included in the transaction has
not yet been physically written to the journal, or new log
records are still being added to the transaction. In case of
system failure, the image of the transaction stored in the
journal is likely not up-to-date. Therefore, when recovering
from a system failure, e2fsck does not trust the
incomplete transactions in the journal and skips them. In this case,
the t_state field stores
one of the following values:
T_RUNNINGStill accepting new atomic operation handles.
T_LOCKEDNot accepting new atomic operation handles, but some of them are still unfinished.
T_FLUSHAll atomic operation handles have finished, but some log records are still being written to the journal.
T_COMMITAll log records of the atomic operation handles have been written to disk, but the transaction has yet to be marked as completed on the journal.
在任何时候,日志都可能包含多个事务,但只有其中一个处于该状态T_RUNNING- 它是
正在接受 Ext3 文件系统发出的新原子操作句柄请求的活动事务。
At any time the journal may include several transactions, but
only one of them is in the T_RUNNING state — it is the
active transaction that is accepting the new
atomic operation handle requests issued by the Ext3
filesystem.
日志中的多个事务可能不完整,因为包含相关日志记录的缓冲区尚未写入日志。
Several transactions in the journal might be incomplete, because the buffers containing the relative log records have not yet been written to the journal.
如果事务完成,则其所有日志记录都已写入日志,但某些相应的缓冲区尚未写入文件系统。当 JBD 层验证日志记录描述的所有缓冲区已成功写入 Ext3 文件系统时,将从日志中删除完整的事务。
If a transaction is complete, all its log records have been written to the journal but some of the corresponding buffers have yet to be written onto the filesystem. A complete transaction is deleted from the journal when the JBD layer verifies that all buffers described by the log records have been successfully written onto the Ext3 filesystem.
让我们尝试用一个例子来解释日志是如何工作的:Ext3 文件系统层接收到写入常规文件的一些数据块的请求。
Let's try to explain how journaling works with an example: the Ext3 filesystem layer receives a request to write some data blocks of a regular file.
您可能很容易猜到,我们不会详细描述 Ext3 文件系统层和 JBD 层的每一个操作。将会有太多的问题需要涵盖!然而,我们描述了基本的行动:
As you might easily guess, we are not going to describe in detail every single operation of the Ext3 filesystem layer and of the JBD layer. There would be far too many issues to be covered! However, we describe the essential actions:
的服务程序write(
) 系统调用触发write与Ext3常规文件关联的文件对象的方法。对于Ext3,这个方法是通过第16章“写入文件”generic_file_write(
)一节中已经描述过的函数来实现的。
The service routine of the write(
) system call triggers the write method of the file object
associated with the Ext3 regular file. For Ext3, this method is
implemented by the generic_file_write(
) function, already described in the section "Writing to a File"
in Chapter 16.
该函数多次generic_file_write( )
调用prepare_write该对象的方法,对于写入操作涉及的每个数据页调用一次。address_space对于Ext3来说,这个方法是通过ext3_prepare_write( )函数来实现的。
The generic_file_write( )
function invokes the prepare_write method of the address_space object several times, once
for every page of data involved by the write operation. For Ext3,
this method is implemented by the ext3_prepare_write( ) function.
该ext3_prepare_write( )
函数通过调用 JBD 函数来启动新的原子操作journal_start( )。该句柄被添加到活动事务中。实际上,原子操作句柄仅在执行函数的第一次调用时创建journal_start(
)。以下调用验证
journal_info进程描述符的字段是否已设置并使用引用的句柄。
The ext3_prepare_write( )
function starts a new atomic operation by invoking the journal_start( ) JBD function. The
handle is added to the active transaction. Actually, the atomic
operation handle is created only when executing the first
invocation of the journal_start(
) function. Following invocations verify that the
journal_info field of the
process descriptor is already set and use the referenced
handle.
该ext3_prepare_write( )
函数调用第 16 章block_prepare_write(
)中已经描述的函数,并将函数的地址传递给它。请记住,它负责准备文件页面的缓冲区和缓冲区头。ext3_get_block(
)block_prepare_write( )
The ext3_prepare_write( )
function invokes the block_prepare_write(
) function already described in Chapter 16, passing to it the
address of the ext3_get_block(
) function. Remember that block_prepare_write( ) takes care of
preparing the buffers and the buffer heads of the file's
page.
当内核必须确定 Ext3 文件系统块的逻辑编号时,它会执行该ext3_get_block( )函数。该功能实际上类似于前面“分配数据块ext2_get_block( )”一节中描述的功能。然而,一个关键的区别是 Ext3 文件系统调用 JBD 层的函数来确保记录低级操作:
请注意,由 JBD 层处理的元数据缓冲区通常不包含在 inode 缓冲区的脏列表中,因此它们不会通过第 15 章中描述的正常磁盘缓存刷新机制写入磁盘。
When the kernel must determine the logical number of a block
of the Ext3 filesystem, it executes the ext3_get_block( ) function. This
function is actually similar to ext2_get_block( ), which is described in
the earlier section "Allocating a Data
Block." A crucial difference, however, is that the Ext3
filesystem invokes functions of the JBD layer to ensure that the
low-level operations are logged:
Before issuing a low-level write
operation on a metadata block of the filesystem, the function invokes
journal_get_write_access(
). Basically, this latter function adds the metadata
buffer to a list of the active transaction. However, it must
also check whether the metadata is included in an older
incomplete transaction of the journal; in this case, it
duplicates the buffer to make sure that the older transactions
are committed with the old content.
After updating the buffer
containing the metadata block, the Ext3 filesystem invokes
journal_dirty_metadata( )
to move the metadata buffer to the proper dirty list of the
active transaction and to log the operation in the
journal.
Notice that metadata buffers handled by the JBD layer are not usually included in the dirty lists of buffers of the inode, so they are not written to disk by the normal disk cache flushing mechanisms described in Chapter 15.
如果 Ext3 文件系统已以“日志”模式挂载,则该ext3_prepare_write( )
函数还会调用journal_get_write_access( )写入操作触及的每个缓冲区。
If the Ext3 filesystem has been mounted in "journal" mode,
the ext3_prepare_write( )
function also invokes journal_get_write_access( ) on every
buffer touched by the write operation.
控制权返回到该generic_file_write( )函数,该函数使用用户模式地址空间中存储的数据更新页面,然后调用commit_write该address_space对象的方法。对于 Ext3,实现此方法的函数取决于 Ext3 文件系统的挂载方式:
如果 Ext3 文件系统已以“日志”模式挂载,则该commit_write
方法由函数实现ext3_journalled_commit_write( )
,该函数调用journal_dirty_metadata( )页面中的每个数据缓冲区(不是元数据)。这样,缓冲区就包含在活动事务的正确脏列表中,而不是包含在所有者 inode 的脏列表中;并且,相应的日志记录被写入到日志中。最后ext3_journalled_commit_write( )
调用journal_stop( )通知JBD层原子操作句柄已关闭。
如果 Ext3 文件系统已以“有序”模式挂载,则该commit_write
方法由该函数实现,该函数在页面中的每个数据缓冲区上ext3_ordered_commit_write( )
调用该函数,以将该缓冲区插入到活动事务的正确列表中。journal_dirty_data( )JBD 层确保此列表中的所有缓冲区在事务的元数据缓冲区之前写入磁盘。没有日志记录写入日志。接下来,执行第 15 章中描述的ext3_ordered_commit_write( )
普通函数,该函数将数据缓冲区插入所有者 inode 的脏缓冲区列表中。最后,generic_commit_write( )ext3_ordered_commit_write( )调用
journal_stop( )以通知 JBD 层原子操作句柄已关闭。
如果Ext3文件系统已经以“writeback”模式挂载,则该commit_write
方法由该函数实现,该函数执行第15章中描述的ext3_writeback_commit_write( )
正常函数,该函数将数据缓冲区插入所有者inode的脏缓冲区列表中。然后,
调用通知JBD层原子操作句柄已关闭。generic_commit_write( )ext3_writeback_commit_write( )journal_stop( )
Control returns to the generic_file_write( ) function, which
updates the page with the data stored in the User Mode address
space and then invokes the commit_write method of the address_space object. For Ext3, the
function that implements this method depends on how the Ext3
filesystem has been mounted:
If the Ext3 filesystem has been mounted in "journal"
mode, the commit_write
method is implemented by the ext3_journalled_commit_write( )
function, which invokes journal_dirty_metadata( ) on every
buffer of data (not metadata) in the page. This way, the
buffer is included in the proper dirty list of the active
transaction and not in the dirty list of the owner inode;
moreover, the corresponding log records are written to the
journal. Finally, ext3_journalled_commit_write( )
invokes journal_stop( ) to
notify the JBD layer that the atomic operation handle is
closed.
If the Ext3 filesystem has been mounted in "ordered"
mode, the commit_write
method is implemented by the ext3_ordered_commit_write( )
function, which invokes the journal_dirty_data( ) function on
every buffer of data in the page to insert the buffer in a
proper list of the active transactions. The JBD layer ensures
that all buffers in this list are written to disk before the
metadata buffers of the transaction. No log record is written
onto the journal. Next, ext3_ordered_commit_write( )
executes the normal generic_commit_write( ) function
described in Chapter
15, which inserts the data buffers in the list of the
dirty buffers of the owner inode. Finally, ext3_ordered_commit_write( ) invokes
journal_stop( ) to notify
the JBD layer that the atomic operation handle is
closed.
If the Ext3 filesystem has been mounted in "writeback"
mode, the commit_write
method is implemented by the ext3_writeback_commit_write( )
function, which executes the normal generic_commit_write( ) function
described in Chapter
15, which inserts the data buffers in the list of the
dirty buffers of the owner inode. Then, ext3_writeback_commit_write( )
invokes journal_stop( ) to
notify the JBD layer that the atomic operation handle is
closed.
系统调用的服务例程write(
)到此结束。然而,JBD层还没有完成它的工作。最终,当所有日志记录都已物理写入日志时,我们的事务就完成了。然后就journal_commit_transaction( )被执行了。
The service routine of the write(
) system call terminates here. However, the JBD layer
has not finished its work. Eventually, our transaction becomes
complete when all its log records have been physically written to
the journal. Then journal_commit_transaction( ) is
executed.
如果 Ext3 文件系统已以“有序”模式挂载,则该journal_commit_transaction(
)函数会激活事务列表中包含的所有数据缓冲区的 I/O 数据传输,并等待所有数据传输终止。
If the Ext3 filesystem has been mounted in "ordered" mode,
the journal_commit_transaction(
) function activates the I/O data transfers for all data
buffers included in the list of the transaction and waits until
all data transfers terminate.
该journal_commit_transaction(
)函数激活事务中包含的所有元数据缓冲区的 I/O 数据传输(如果 Ext3 以“日志”模式安装,则还激活所有数据缓冲区)。
The journal_commit_transaction(
) function activates the I/O data transfers for all
metadata buffers included in the transaction (and also for all
data buffers, if Ext3 was mounted in "journal" mode).
内核会定期为日志中的每个完整事务激活检查点活动。检查点主要涉及验证 触发的 I/O 数据传输是否已journal_commit_transaction(
)成功终止。如果是这样,则可以从日记帐中删除该交易。
Periodically, the kernel activates a checkpoint activity for
every complete transaction in the journal. The checkpoint
basically involves verifying whether the I/O data transfers
triggered by journal_commit_transaction(
) have successfully terminated. If so, the transaction
can be deleted from the journal.
当然,日志中的日志记录在系统发生故障之前永远不会发挥积极作用。仅在系统重新启动期间, e2fsck实用程序才会 扫描文件系统中存储的日志,并重新安排完整事务的日志记录所描述的所有写入操作。
Of course, the log records in the journal never play an active role until a system failure occurs. Only during system reboot does the e2fsck utility program scan the journal stored in the filesystem and reschedule all write operations described by the log records of the complete transactions.
本章解释用户模式进程如何同步其操作和交换数据。我们已经在第 5 章中讨论了几个同步主题,但其中的参与者是内核控制路径,而不是用户模式程序。在详细讨论了 I/O 管理和文件系统之后,我们现在准备将讨论扩展到用户模式进程。这些进程必须依赖内核来促进进程间的同步和通信。
This chapter explains how User Mode processes can synchronize their actions and exchange data. We already covered several synchronization topics in Chapter 5, but the actors there were kernel control paths, not User Mode programs. We are now ready, after having discussed I/O management and filesystems at length, to extend the discussion to User Mode processes. These processes must rely on the kernel to facilitate interprocess synchronization and communication.
正如我们在第 12 章的“ Linux 文件锁定”部分中看到的,用户模式进程之间的一种同步形式可以通过创建一个(可能是空的)文件并使用合适的 VFS 系统调用来锁定和解锁它来实现。虽然进程可以类似地通过受锁保护的临时文件共享数据,但这种方法成本高昂,因为它需要访问磁盘上的文件系统。因此,所有 Unix 内核都包含一组系统调用,支持进程通信而无需与文件系统交互;此外,还开发了几个包装函数并将其插入到合适的库中,以加快进程向内核发出同步请求的速度。
As we saw in the section "Linux File Locking" in Chapter 12, a form of synchronization among User Mode processes can be achieved by creating a (possibly empty) file and using suitable VFS system calls to lock and unlock it. While processes can similarly share data via temporary files protected by locks, this approach is costly because it requires accesses to the filesystem on disk. For this reason, all Unix kernels include a set of system calls that supports process communication without interacting with the filesystem; furthermore, several wrapper functions were developed and inserted in suitable libraries to expedite how processes issue their synchronization requests to the kernel.
与往常一样,应用程序程序员有各种不同的需求,需要不同的通信机制。以下是 Unix 系统提供的允许进程间通信的基本机制:
As usual, application programmers have a variety of needs that call for different communication mechanisms. Here are the basic mechanisms that Unix systems offer to allow interprocess communication:
最适合实现流程之间的生产者/消费者交互。一些进程用数据填充管道,而另一些进程则从管道中提取数据。它们在“管道”和“ FIFO ”部分中介绍。
Best suited to implement producer/consumer interactions among processes. Some processes fill the pipe with data, while others extract data from the pipe. They are covered in the sections "Pipes" and "FIFOs."
顾名思义,代表第 5 章“信号量”部分中讨论的内核信号量的用户模式版本。它们在“ System V IPC ”部分中进行了描述。
Represent, as the name implies, the User Mode version of the kernel semaphores discussed in the section "Semaphores" in Chapter 5. They are described in the section "System V IPC."
允许进程通过在预定义的消息队列中读取和写入消息(短数据块)来交换消息。Linux 内核提供两种不同版本的消息:System V IPC 消息(在“ System V IPC ”部分中介绍)和 POSIX 消息(在“ POSIX 消息队列”部分中介绍)。
Allow processes to exchange messages (short blocks of data) by reading and writing them in predefined message queues. The Linux kernel offers two different versions of messages: System V IPC messages (covered in the section "System V IPC") and POSIX messages (described in the section "POSIX Message Queues").
允许进程通过共享内存块交换信息。在必须共享大量数据的应用程序中,这可能是最有效的进程通信形式。它们在“ System V IPC ”部分中进行了描述。
Allow processes to exchange information via a shared block of memory. In applications that must share large amounts of data, this can be the most efficient form of process communication. They are described in the section "System V IPC."
允许不同计算机上的进程通过网络交换数据。套接字还可以用作位于同一主机上的进程的通信工具;X窗口系统例如,图形界面使用套接字允许客户端程序与 X 服务器交换数据。
Allow processes on different computers to exchange data through a network. Sockets can also be used as a communication tool for processes located on the same host computer; the X Window System graphic interface, for instance, uses a socket to allow client programs to exchange data with the X server.
管道是一种进程间通信机制,在所有版本的 Unix 中都提供。管道是进程之间的单向数据流:进程写入管道的所有数据都由内核路由到另一个进程,从而可以读取它。
Pipes are an interprocess communication mechanism that is provided in all flavors of Unix. A pipe is a one-way flow of data between processes: all data written by a process to the pipe is routed by the kernel to another process, which can thus read it.
在 Unix 命令 shell 中,管道可以通过|操作符创建。例如,以下语句指示 shell 创建两个通过管道连接的进程:
In Unix command shells, pipes can be created by means of the | operator. For instance, the following
statement instructs the shell to create two processes connected by a
pipe:
$ ls | 更多的
$ ls | more
执行 ls程序的第一个进程的标准输出被重定向到管道;第二个进程执行more程序,从管道读取其输入。
The standard output of the first process, which executes the ls program, is redirected to the pipe; the second process, which executes the more program, reads its input from the pipe.
请注意,通过发出如下两个命令也可以获得相同的结果:
Note that the same results can also be obtained by issuing two commands such as the following:
$ ls > 温度 $ 更多 < 温度
$ ls > temp $ more < temp
第一个命令将ls的输出重定向到常规文件中;然后第二个命令强制more从同一文件读取其输入。当然,使用管道代替临时文件通常更方便,原因如下:
The first command redirects the output of ls into a regular file; then the second command forces more to read its input from the same file. Of course, using pipes instead of temporary files is usually more convenient due to the following reasons:
shell 语句更短更简单。
The shell statement is much shorter and simpler.
无需创建临时常规文件,稍后必须将其删除。
There is no need to create temporary regular files, which must be deleted later.
管道可以被视为在已安装的文件系统中没有相应映像的打开文件。进程通过pipe( )
系统调用创建一个新管道,该系统调用返回一对文件描述符; 然后,该进程可以通过以下方式将这些描述符传递给其后代:fork( )
,从而与他们共享管道。进程可以使用read(
)带有第一个文件描述符的系统调用从管道中读取数据;write(
)同样,他们可以通过使用带有第二个文件描述符的系统调用来写入管道。
Pipes may be considered open files that have no
corresponding image in the mounted filesystems. A process creates a
new pipe by means of the pipe( )
system call, which returns a pair of file descriptors ; the process may then pass these descriptors to its
descendants through fork( )
, thus sharing the pipe with them. The processes can
read from the pipe by using the read(
) system call with the first file descriptor; likewise, they
can write into the pipe by using the write(
) system call with the second file descriptor.
POSIX 仅定义半双工管道
,因此即使pipe(
)系统调用返回两个文件描述符,每个进程也必须在使用另一个之前关闭一个文件描述符。如果需要双向数据流,则进程必须通过调用
pipe( )两次来使用两个不同的管道。
POSIX defines only half-duplex pipes
, so even though the pipe(
) system call returns two file descriptors, each process
must close one before using the other. If a two-way flow of data is
required, the processes must use two different pipes by invoking
pipe( ) twice.
多种 Unix 系统,例如 System VRelease 4,实现全双工管道。在全双工管道中,两个描述符都可以写入和读取,因此存在两个双向信息通道。Linux 采用了另一种方法:每个管道的文件描述符仍然是单向的,但没有必要在使用另一个之前关闭其中一个。
Several Unix systems, such as System V Release 4, implement full-duplex pipes . In a full-duplex pipe, both descriptors can be written into and read from, thus there are two bidirectional channels of information. Linux adopts yet another approach: each pipe's file descriptors are still one-way, but it is not necessary to close one of them before using the other.
让我们继续前面的例子。当命令 shell 解释该ls|more语句时,它本质上执行以下操作:
Let's resume the previous example. When the command shell
interprets the ls|more statement,
it essentially performs the following actions:
调用pipe( )
系统调用;假设pipe(
)返回文件描述符 3(管道的
读取通道)和 4(写入通道)。
Invokes the pipe( )
system call; let's assume that pipe(
) returns the file descriptors 3 (the pipe's
read channel) and 4 (the write
channel).
调用fork( )
系统调用两次。
Invokes the fork( )
system call twice.
调用close( )
系统调用两次以释放文件描述符 3 和 4。
Invokes the close( )
system call twice to release file descriptors 3 and 4.
第一个子进程必须执行ls程序,执行以下操作:
The first child process, which must execute the ls program, performs the following operations:
调用dup2(4,1)将文件描述符 4 复制到文件描述符 1。从现在开始,文件描述符 1 引用管道的写入通道。
Invokes dup2(4,1) to copy
file descriptor 4 to file descriptor 1. From now on, file
descriptor 1 refers to the pipe's write channel.
调用close( )
系统调用两次以释放文件描述符 3 和 4。
Invokes the close( )
system call twice to release file descriptors 3 and 4.
调用execve( )
系统调用来执行ls程序(请参阅第 20 章中的“ exec 函数”
部分)。程序将其输出写入具有文件描述符 1 的文件(标准输出);即,它写入管道。
Invokes the execve( )
system call to execute the ls
program (see the section "The exec Functions" in
Chapter 20). The
program writes its output to the file that has file descriptor 1
(the standard output); i.e., it writes into the pipe.
第二个子进程必须执行more程序;因此,它执行以下操作:
The second child process must execute the more program; therefore, it performs the following operations:
调用dup2(3,0)将文件描述符 3 复制到文件描述符 0。从现在开始,文件描述符 0 引用管道的读取通道。
Invokes dup2(3,0) to copy
file descriptor 3 to file descriptor 0. From now on, file
descriptor 0 refers to the pipe's read channel.
调用close( )
系统调用两次以释放文件描述符 3 和 4。
Invokes the close( )
system call twice to release file descriptors 3 and 4.
调用execve( )
系统调用来执行更多。默认情况下,该程序从文件描述符为 0(标准输入)的文件中读取其输入;即,它从管道读取。
Invokes the execve( )
system call to execute more.
By default, that program reads its input from the file that has
file descriptor 0 (the standard input); i.e., it reads from the
pipe.
在这个简单的示例中,管道恰好由两个进程使用。不过,由于其实现方式,管道可以由任意数量的进程使用。[ * ]显然,如果两个或多个进程读取或写入同一管道,它们必须通过使用文件锁定(请参阅第 12 章中的“ Linux 文件锁定” 部分)或 IPC 信号量(请参阅“IPC 信号量”部分)显式同步其访问”在本章后面)。
In this simple example, the pipe is used by exactly two processes. Because of its implementation, though, a pipe can be used by an arbitrary number of processes.[*] Clearly, if two or more processes read or write the same pipe, they must explicitly synchronize their accesses by using file locking (see the section "Linux File Locking" in Chapter 12) or IPC semaphores (see the section "IPC Semaphores" later in this chapter).
除了系统调用之外,许多 Unix 系统还提供两个名为和 的pipe( )包装函数,用于处理使用管道时通常完成的所有脏工作。一旦通过函数创建了管道,它就可以与 C 库中包含的高级 I/O 函数( 、等)一起使用。popen( )pclose( )popen( )fprintf( )fscanf( )
Many Unix systems provide, besides the pipe( ) system call, two wrapper functions
named popen( ) and pclose( ) that handle all the dirty work
usually done when using pipes. Once a pipe has been created by means
of the popen( ) function, it can be
used with the high-level I/O functions included in the C library
(fprintf( ), fscanf( ), and so on.
在Linux中,popen( )和
pclose( )都包含在C库中。该popen( )函数接收两个参数:filename可执行文件的路径名和type指定数据传输方向的字符串。它返回指向数据结构的指针FILE。该popen( )函数主要执行以下操作:
In Linux, popen( ) and
pclose( ) are included in the C
library. The popen( ) function
receives two parameters: the filename pathname of an executable file and
a type string specifying the
direction of the data transfer. It returns the pointer to a FILE data structure. The popen( ) function essentially performs the
following operations:
使用pipe( )系统调用创建一个新管道。
Creates a new pipe by using the pipe( ) system call.
Fork出一个新进程,该进程依次执行以下操作:
如果type是r,则它将与管道的写入通道关联的文件描述符复制为文件描述符1(标准输出);否则,如果type是w,它将与管道的读取通道关联的文件描述符复制为文件描述符0(标准输入)。
关闭返回的文件描述符pipe( ).
调用execve( )
系统调用来执行指定的程序filename.
Forks a new process, which in turn executes the following operations:
If type is r, it duplicates the file descriptor
associated with the pipe's write channel as file descriptor 1
(standard output); otherwise, if type is w, it duplicates the file descriptor
associated with the pipe's read channel as file descriptor 0
(standard input).
Closes the file descriptors returned by pipe( ).
Invokes the execve( )
system call to execute the program specified by filename.
如果type是r,则关闭与管道的写入通道关联的文件描述符;否则,如果type是w,它将关闭与管道的读取通道关联的文件描述符。
If type is r, it closes the file descriptor
associated with the pipe's write channel; otherwise, if type is w, it closes the file descriptor
associated with the pipe's read channel.
返回文件指针的地址FILE,该文件指针引用管道的仍打开的任何文件描述符。
Returns the address of the FILE file pointer that refers to
whichever file descriptor for the pipe is still open.
调用后popen( )
,父级和子级可以通过管道交换信息:父级可以使用函数返回的指针读取(如果type
是r)或写入(如果type是)数据。数据分别由子进程执行的程序写入标准输出或从标准输入读取。wFILE
After the popen( )
invocation, parent and child can exchange information through the
pipe: the parent can read (if type
is r) or write (if type is w) data by using the FILE pointer returned by the function. The
data is written to the standard output or read from the standard
input, respectively, by the program executed by the child
process.
该pclose( )函数(接收由返回的文件指针popen(
)作为其参数)只是调用wait4( )系统调用并等待由 所创建的进程终止popen(
)。
The pclose( ) function (which
receives the file pointer returned by popen(
) as its parameter) simply invokes the wait4( ) system call and waits for the
termination of the process created by popen(
).
我们现在必须在系统调用层面重新开始思考。创建管道后,进程将使用read( )和write(
)VFS 系统调用来访问它。因此,对于每个管道,内核都会创建一个 inode 对象和两个文件对象——一个用于读取,另一个用于写入。当进程想要读取或写入管道时,它必须使用正确的文件描述符。
We now have to start thinking again at the system call
level. Once a pipe is created, a process uses the read( ) and write(
) VFS system calls to access it. Therefore, for each pipe,
the kernel creates an inode object plus two file objects—one for
reading and the other for writing. When a process wants to read from
or write to the pipe, it must use the proper file descriptor.
当inode对象引用管道时,其i_pipe字段指向表19-1pipe_inode_info所示的结构。
When the inode object refers to a pipe, its i_pipe field points to a pipe_inode_info structure shown in Table 19-1.
表 19-1。pipeline_inode_info结构
Table 19-1. The pipe_inode_info structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 管道/先进先出等待队列 Pipe/FIFO wait queue |
| | 包含要读取的数据的缓冲区数量 Number of buffers containing data to be read |
| | 包含要读取的数据的第一个缓冲区的索引 Index of first buffer containing data to be read |
结构体pipe_buffer [16] struct pipe_buffer [16] | 缓冲区 bufs | 管道缓冲区描述符数组 Array of pipe's buffer descriptors |
结构页 * struct page * | 临时页面 tmp_page | 指向缓存页框的指针 Pointer to a cached page frame |
| | 读取当前管道缓冲区中的位置 Read position in current pipe buffer |
| | 读取进程(或数量)的标志 Flag for (or number of) reading processes |
| | 写入进程(或数量)的标志 Flag for (or number of) writing processes |
| | 等待队列中休眠的写入进程数 Number of writing processes sleeping in the wait queue |
| | 与 类似 Like |
| | 类似 Like |
结构体 struct fasync_struct * fasync_struct * | fasync_readers fasync_readers | 用于通过信号进行异步 I/O 通知 Used for asynchronous I/O notification via signals |
结构体 struct fasync_struct * fasync_struct * | fasync_writers fasync_writers | 用于通过信号进行异步 I/O 通知 Used for asynchronous I/O notification via signals |
除了一个 inode 和两个文件对象之外,每个管道还有自己的一组管道缓冲区 。本质上,管道缓冲区是一个页帧,其中包含写入管道但尚未读取的数据。到 Linux 2.6.10 为止,每个管道只有一个管道缓冲区。然而,在 2.6.11 内核中,管道(和 FIFO)的数据缓冲已进行了大量修改,现在每个管道使用 16 个管道缓冲区。此更改极大地增强了在管道中写入大量数据的用户模式应用程序的性能。
Besides one inode and two file objects, each pipe has its own set of pipe buffers . Essentially, a pipe buffer is a page frame that contains data written into the pipe and yet to be read. Up to Linux 2.6.10, each pipe had just one pipe buffer. In the 2.6.11 kernel, however, data buffering for pipes (and FIFOs) has been heavily revised, and now each pipe makes use of 16 pipe buffers. This change greatly enhances the performance of User Mode applications that write large chunks of data in a pipe.
该数据结构bufs的字段
pipe_inode_info存储了一个由 16 个对象组成的数组pipe_buffer
,每个对象描述一个管道缓冲区。该对象的字段如表19-2所示。
The bufs field of the
pipe_inode_info data structure
stores an array of 16 pipe_buffer
objects, each of which describes a pipe buffer. The fields of this
object are shown in Table
19-2.
表 19-2。pipeline_buffer对象的字段
Table 19-2. The fields of the pipe_buffer object
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 管道缓冲区页框描述符的地址 Address of the descriptor of the page frame for the pipe buffer |
| | 页框内重要数据的当前位置 Current position of the significant data inside the page frame |
| | 管道缓冲区中重要数据的长度 Length of the significant data in the pipe buffer |
| | 相对于管道缓冲区的方法表的地址( Address of a table of methods
relative to the pipe buffer ( |
该ops字段指向
anon_pipe_buf_ops管道缓冲区的方法表,它是 类型的数据结构pipe_buf_operations。本质上,该表包括三种方法:
The ops field points to the
anon_pipe_buf_ops table of the pipe
buffer's methods, which is a data structure of type pipe_buf_operations. Essentially, the table
includes three methods:
mapmap在访问管道缓冲区中的数据之前调用。它只是在管道缓冲区的页帧上调用kmap( ),以防管道缓冲区存储在高端内存中(请参阅第 8 章中的“高端内存页帧的内核映射”部分)。
Invoked before accessing data in the pipe buffer. It
simply invokes kmap( ) on the
pipe buffer's page frame, just in case the pipe buffer is stored
in high memory (see the section "Kernel Mappings of
High-Memory Page Frames" in Chapter 8).
unmapunmap当不再访问管道缓冲区中的数据时调用。kunmap( )它在管道缓冲区的页框架上调用。
Invoked when no longer accessing data in the pipe buffer.
It invokes kunmap( ) on the
pipe buffer's page frame.
releaserelease当管道缓冲区被释放时调用。该方法实现了一页内存缓存:释放的页帧不是存储缓冲区的页帧,而是数据结构的字段指向的缓存页帧tmp_page(pipe_inode_info如果不是NULL)。存储缓冲区的页框成为新的缓存页框。
Invoked when a pipe buffer is being released. The method
implements a one-page memory cache: the page frame released is
not the one storing the buffer, but a cached page frame pointed
to by the tmp_page field of
the pipe_inode_info data
structure (if not NULL). The
page frame that stored the buffer becomes the new cached page
frame.
16 个管道缓冲区可以看作是一个全局的循环缓冲区:写入过程不断向这个大缓冲区添加数据,而读取过程不断删除数据。当前写入所有管道缓冲区但尚未读取的字节数就是所谓的管道大小。出于效率的考虑,尚未读取的数据可以分布在几个部分填充的管道缓冲区中:事实上,如果前一个管道缓冲区没有足够的可用空间来存储,则每个写操作都可以将数据复制到新的空管道缓冲区中新数据。因此,内核必须跟踪:
The 16 pipe buffers can be seen as a global, circular buffer: writing processes keep adding data to this large buffer, while reading process keep removing them. The number of bytes currently written in all pipe buffers and yet to be read is the so-called pipe size. For reasons of efficiency, the data yet to be read can be spread among several partially filled pipe buffers: in fact, each write operation may copy the data in a fresh, empty pipe buffer if the previous pipe buffer has not enough free space to store the new data. Hence, the kernel must keep track of:
管道缓冲区,包含要读取的下一个字节以及页框内相应的偏移量。该管道缓冲区的索引存储在数据结构curbuf的字段中pipe_inode_info,而偏移量存储在offset相应
pipe_buffer对象的字段中。
The pipe buffer that includes the next byte to be read, and
the corresponding offset inside the page frame. The index of this
pipe buffer is stored in the curbuf field of the pipe_inode_info data structure, while
the offset is stored in the offset field of the corresponding
pipe_buffer object.
第一个空管道缓冲区。其索引可以通过将当前管道缓冲区的索引(存储在数据结构curbuf的字段中pipe_inode_info)和具有重要数据的管道缓冲区的数量(存储在该nrbufs
字段中)相加(模16)来计算。
The first empty pipe buffer. Its index can be computed by
adding (modulo 16) the index of the current pipe buffer, which is
stored in the curbuf field of
the pipe_inode_info data
structure, and the number of pipe buffers with significant data,
which is stored in the nrbufs
field.
为了避免管道数据结构上的竞争条件,内核使用i_sem
inode 对象中包含的信号量。
To avoid race conditions on the pipe's data structures, the
kernel makes use of the i_sem
semaphore included in the inode object.
管道被实现为一组 VFS 对象,这些对象没有相应的磁盘映像。在Linux 2.6中,这些VFS对象被组织到pipefs特殊文件系统中以加速它们的处理(参见第12章中的“特殊文件系统”部分)。由于该文件系统在系统目录树中没有挂载点,因此用户永远看不到它。然而,得益于Pipefs,管道完全集成在 VFS 层中,内核可以像命名管道或 FIFO 一样处理它们,它们真正以最终用户可识别的文件形式存在(请参阅后面的“FIFO”部分) 。
A pipe is implemented as a set of VFS objects, which have no corresponding disk images. In Linux 2.6, these VFS objects are organized into the pipefs special filesystem to expedite their handling (see the section "Special Filesystems" in Chapter 12). Because this filesystem has no mount point in the system directory tree, users never see it. However, thanks to pipefs, the pipes are fully integrated in the VFS layer, and the kernel can handle them in the same way as named pipes or FIFOs, which truly exist as files recognizable to end users (see the later section "FIFOs").
该init_pipe_fs( )
函数通常在内核初始化期间执行,注册并挂载pipelinefs文件系统(请参阅第 12 章“挂载通用文件系统”一节中的讨论):
The init_pipe_fs( )
function, typically executed during kernel initialization, registers
the pipefs filesystem and mounts it (refer to
the discussion in the section "Mounting a Generic
Filesystem" in Chapter
12):
结构文件系统类型管道文件系统类型;
pipeline_fs_type.name = "pipefs";
pipeline_fs_type.get_sb = pipelinefs_get_sb;
pipeline_fs.kill_sb = Kill_anon_super;
register_filesystem(&pipe_fs_type);
pipeline_mnt = do_kern_mount("pipefs", 0, "pipefs", NULL);struct file_system_type pipe_fs_type;
pipe_fs_type.name = "pipefs";
pipe_fs_type.get_sb = pipefs_get_sb;
pipe_fs.kill_sb = kill_anon_super;
register_filesystem(&pipe_fs_type);
pipe_mnt = do_kern_mount("pipefs", 0, "pipefs", NULL);表示Pipefs根目录的已安装文件系统对象存储在该pipe_mnt变量中。
The mounted filesystem object that represents the root
directory of pipefs is stored in the pipe_mnt variable.
系统pipe( )调用由sys_pipe( )
函数提供服务,函数又调用该do_pipe(
)函数。要创建新管道,do_pipe( )请执行以下操作:
The pipe( ) system
call is serviced by the sys_pipe( )
function, which in turn invokes the do_pipe(
) function. To create a new pipe, do_pipe( ) performs the following
operations:
调用该函数,该函数为pipelinefsget_pipe_inode(
)文件系统中的管道分配并初始化 inode 对象。具体来说,该函数执行以下操作:
在pipelinefs文件系统中分配一个新的 inode 。
分配一个pipe_inode_info数据结构并将其地址存储在i_pipeinode 字段中。
将结构体的
curbuf和
字段设置为 0;另外,用零填充数组中管道缓冲区对象的所有字段
。nrbufspipe_inode_infobufs
将结构体的r_counter和字段初始化为 1。w_counterpipe_inode_info
将结构体的
readers和
字段设置为 1。writerspipe_inode_info
Invokes the get_pipe_inode(
) function, which allocates and initializes an inode
object for the pipe in the pipefs filesystem.
In particular, this function executes the following
actions:
Allocates a new inode in the pipefs filesystem.
Allocates a pipe_inode_info data structure and
stores its address in the i_pipe field of the inode.
Sets the curbuf and
nrbufs fields of the
pipe_inode_info structure
to 0; also, fills with zeros all fields of the pipe buffer
objects in the bufs
array.
Initializes the r_counter and w_counter fields of the pipe_inode_info structure to
1.
Sets the readers and
writers fields of the
pipe_inode_info structure
to 1.
为管道的读通道分配一个文件对象和一个文件描述符,将f_flag文件对象的字段设置为
O_RDONLY,并
f_op用表的地址初始化该字段read_ pipe_fops
。
Allocates a file object and a file descriptor for the read
channel of the pipe, sets the f_flag field of the file object to
O_RDONLY, and initializes the
f_op field with the address of
the read_ pipe_fops
table.
为管道的写入通道分配一个文件对象和文件描述符,将flag文件对象的字段设置为
O_WRONLY,并
f_op使用表的地址初始化该字段write_ pipe_fops
。
Allocates a file object and a file descriptor for the write
channel of the pipe, sets the flag field of the file object to
O_WRONLY, and initializes the
f_op field with the address of
the write_ pipe_fops
table.
分配一个dentry对象并用它来链接两个文件对象和inode对象(参见第12章“通用文件模型”一节);然后将新的 inode 插入到 pipelinefs特殊文件系统中。
Allocates a dentry object and uses it to link the two file objects and the inode object (see the section "The Common File Model" in Chapter 12); then inserts the new inode in the pipefs special filesystem.
发出系统调用的进程pipe(
)最初是唯一可以访问新管道(无论是读还是写)的进程。为了表示管道同时具有读取器和写入器,数据结构的readers和writers字段pipe_inode_info被初始化为 1。通常,只有当相应管道的文件对象仍然由进程打开时,这两个字段中的每一个才会设置为 1 ; 如果相应的文件对象已被释放,则该字段设置为 0,因为它不再被任何进程访问。
The process that issues a pipe(
) system call is initially the only process that can access
the new pipe, both for reading and writing. To represent that the pipe
has both a reader and a writer, the readers and writers fields of the pipe_inode_info data structure are
initialized to 1. In general, each of these two fields is set to 1
only if the corresponding pipe's file object is still opened by a
process; the field is set to 0 if the corresponding file object has
been released, because it is no longer accessed by any process.
分叉一个新进程不会增加
readers和writers字段的值,因此它们永远不会超过 1;[ * ]然而,它确实增加了父进程仍在使用的所有文件对象的使用计数器的值(请参阅第 3 章中的“ clone()、fork() 和 vfork() 系统调用”部分)。因此,即使父对象死亡,对象也不会被释放,并且管道保持打开状态以供子对象使用。
Forking a new process does not increase the value of the
readers and writers fields, so they never rise above
1;[*] however, it does increase the value of the usage
counters of all file objects still used by the parent process (see the
section "The clone( ),
fork( ), and vfork( ) System Calls" in Chapter 3). Thus, the objects are
not released even when the parent dies, and the pipe stays open for
use by the children.
每当进程调用close(
)与管道关联的文件描述符上的系统调用时,内核都会fput( )
在相应的文件对象上执行该函数,这会减少使用计数器。如果计数器变为0,则函数调用文件操作的方法(参见第12章中的“ close()系统调用”和“与进程关联的文件”release部分)。
Whenever a process invokes the close(
) system call on a file descriptor associated with a pipe,
the kernel executes the fput( )
function on the corresponding file object, which decreases the usage
counter. If the counter becomes 0, the function invokes the release method of the file operations (see
the sections "The close(
) System Call" and "Files Associated with a
Process" in Chapter
12).
根据文件是否与读通道或写通道关联,该方法由或release实现;这两个函数都调用,它将字段
或结构体的字段设置为 0。该函数检查 和字段
是否都等于 0;在这种情况下,它调用所有管道缓冲区的管道缓冲区的方法,从而将所有管道的页帧释放给伙伴系统;此外,该函数还释放该字段所指向的缓存页框。否则,如果字段
或pipe_read_release(
)pipe_write_release(
)pipe_release( )readerswriterspipe_inode_inforeaderswritersreleasetmp_pagereaderswriters字段不为零,该函数会唤醒在管道等待队列中休眠的进程,以便它们能够识别管道状态的变化。
Depending on whether the file is associated with the read or
write channel, the release method
is implemented by either pipe_read_release(
) or pipe_write_release(
); both functions invoke pipe_release( ), which sets either the
readers field or the writers field of the pipe_inode_info structure to 0. The function
checks whether both the readers and
writers fields are equal to 0; in
this case, it invokes the pipe buffer's release method of all pipe buffers, thus
releasing to the buddy system all pipe's page frames; moreover, the
function releases the cached page frame pointed to by the tmp_page field. Otherwise, if either the
readers field or the writers field is not zero, the function
wakes up the processes sleeping in the pipe's wait queue so they can
recognize the change in the pipe state.
希望从管道获取数据的进程发出
read( )系统调用,指定与管道读取端关联的文件描述符。正如第 12 章“ read() 和 write() 系统调用”部分所述,内核最终调用在与正确文件对象关联的文件操作表中找到的方法。对于管道,表中读取方法的条目指向该函数。readread_pipe_fopspipe_read( )
A process wishing to get data from a pipe issues a
read( ) system call, specifying the
file descriptor associated with the pipe's reading end. As described
in the section "The read(
) and write( ) System Calls" in Chapter 12, the kernel ends up
invoking the read method found in
the file operation table associated with the proper file object. In
the case of a pipe, the entry for the read method in the read_pipe_fops table points to the pipe_read( ) function.
该pipe_read( )函数相当复杂,因为 POSIX 标准指定了管道读取操作的几个要求。表 19-3总结了系统调用的预期行为read( )
,该系统调用从管道大小(管道缓冲区中尚未读取的字节数)等于p的管道请求n 个字节。
The pipe_read( ) function is
quite involved, because the POSIX standard specifies several
requirements for the pipe's read operations. Table 19-3 summarizes the
expected behavior of a read( )
system call that requests n bytes from a pipe
that has a pipe size (number of bytes in the pipe buffers yet to be
read) equal to p.
在两种情况下系统调用可能会阻塞当前进程:
The system call might block the current process in two cases:
当系统调用开始时,管道缓冲区为空。
The pipe buffer is empty when the system call starts.
管道缓冲区不包含所有请求的字节,并且写入进程先前在等待缓冲区中的空间时处于休眠状态。
The pipe buffer does not include all requested bytes, and a writing process was previously put to sleep while waiting for space in the buffer.
请注意,读取操作可以是非阻塞的:在这种情况下,一旦所有可用字节(甚至没有)都复制到用户地址空间中,读取操作就会完成。[ * ]
Notice that the read operation can be nonblocking: in this case, it completes as soon as all available bytes (even none) are copied into the user address space.[*]
read( )另请注意,仅当管道为空并且当前没有进程正在使用与管道的写入通道关联的文件对象时,系统调用才会返回值 0 。
Notice also that the value 0 is returned by the read( ) system call only if the pipe is
empty and no process is currently using the file object associated
with the pipe's write channel.
表 19-3。从管道读取 n 个字节
Table 19-3. Reading n bytes from a pipe
至少一个写入过程 At least one writing process | 无写入过程 No writing process | |||
|---|---|---|---|---|
阻塞读取 Blocking read | 非阻塞读取 Nonblocking read | |||
管道尺寸 p Pipe Size p | 睡觉的作家 Sleeping writer | 没有睡觉的作家 No sleeping writer | ||
p = 0 p = 0 | 复制n个字节并返回n,当管道缓冲区为空时等待数据。 Copy n bytes and return n, waiting for data when the pipe buffer is empty. | 等待一些数据,复制它,然后返回它的大小。 Wait for some data, copy it, and return its size. | 返回 Return | 返回 0。 Return 0. |
0 < p < n 0 < p < n | 复制 p字节并返回 p:管道缓冲区中剩余 0 字节。 Copy p bytes and return p: 0 bytes are left in the pipe buffer. | |||
p≥n _ _ p ≥ n | 复制 n 个字节并返回 n: 管道缓冲区中剩余p - n 个字节。 Copy n bytes and return n: p-n bytes are left in the pipe buffer. | |||
该函数执行以下操作:
The function performs the following operations:
获取i_sem
inode的信号量。
Acquires the i_sem
semaphore of the inode.
nrbufs通过读取结构体的字段判断管道大小是否为0
pipe_inode_info;如果该字段等于 0,则所有管道缓冲区均为空。在这种情况下,它确定函数是否必须返回,或者进程是否必须被阻塞,同时等待另一个进程在管道中写入一些数据(参见表19-3)。I/O 操作的类型(阻塞或非阻塞)由
文件对象字段O_NONBLOCK中的标志指定。f_flags如果必须阻止当前进程,该函数将执行以下操作:
调用prepare_to_wait(
)添加current
到管道的等待队列(结构体wait的字段pipe_inode_info)。
释放 inode 信号量。
调用schedule(
).
一旦唤醒,调用从等待队列中finish_wait( )删除,再次获取
inode 信号量,然后跳回步骤 2。currenti_sem
Determines whether the pipe size is 0 by reading the
nrbufs field of the pipe_inode_info structure; if the field
is equal to zero, all pipe buffers are empty. In this case, it
determines whether the function must return or whether the process
must be blocked while waiting until another process writes some
data in the pipe (see Table 19-3). The type
of I/O operation (blocking or nonblocking) is specified by the
O_NONBLOCK flag in the f_flags field of the file object. If the
current process must be blocked, the function performs the
following actions:
Invokes prepare_to_wait(
) to add current
to the wait queue of the pipe (the wait field of the pipe_inode_info structure).
Releases the inode semaphore.
Invokes schedule(
).
Once awake, invokes finish_wait( ) to remove current from the wait queue,
acquires again the i_sem
inode semaphore, and then jumps back to step 2.
curbuf从数据结构的字段获取当前管道缓冲区的索引pipe_inode_info。
Gets the index of the current pipe buffer from the curbuf field of the pipe_inode_info data structure.
执行map管道缓冲区的方法。
Executes the map method
of the pipe buffer.
将请求的字节数(或管道缓冲区中的可用字节数(如果较小))从管道缓冲区复制到用户地址空间。
Copies the requested number of bytes—or the number of available bytes in the pipe buffer, if it is smaller—from the pipe's buffer to the user address space.
执行unmap管道缓冲区的方法。
Executes the unmap method
of the pipe buffer.
更新相应对象的
offset和
字段。lenpipe_buffer
Updates the offset and
len fields of the corresponding
pipe_buffer object.
如果管道缓冲区已被清空(对象len的字段pipe_buffer现在等于零),它将调用管道缓冲区的release方法来释放相应的页框,将对象
ops中的字段设置为,将当前管道缓冲区的索引前进存储在数据结构的字段,并减少该字段中非空管道缓冲区的计数器。pipe_bufferNULLcurbufpipe_inode_infonrbufs
If the pipe buffer has been emptied (len fields of the pipe_buffer object now equal to zero),
it invokes the pipe buffer's release method to free the corresponding
page frame, sets the ops field
in the pipe_buffer object to
NULL, advances the index of the
current pipe buffer stored in the curbuf field of the pipe_inode_info data structure, and
decreases the counter of nonempty pipe buffers in the nrbufs field.
如果所有请求的字节都已复制,则跳转到步骤 12。
If all requested bytes have been copied, it jumps to step 12.
这里,并非所有请求的字节都已复制到用户模式地址空间。如果管道大小大于零(数据结构nrbufs的字段pipe_inode_info不为空),则返回到步骤3。
Here not all requested bytes have been copied to the User
Mode address space. If the pipe size is greater than zero
(nrbufs field of the pipe_inode_info data structure not
null), it goes back to step 3.
管道缓冲区中没有剩余字节。如果当前至少有一个写进程处于休眠状态(即
数据结构waiting_writers的字段
pipe_inode_info大于0),并且读操作处于阻塞状态,则调用
wake_up_interruptible_sync( )
唤醒所有休眠在管道等待队列上的进程,并跳回到步骤 2。
There are no more bytes left in the pipe buffers. If there
is at least one writing process currently sleeping (that is, the
waiting_writers field of the
pipe_inode_info data structure
is greater than 0), and the read operation is blocking, it invokes
wake_up_interruptible_sync( )
to wake up all processes sleeping on the pipe's wait queue, and
jumps back to step 2.
释放i_sem
inode的信号量。
Releases the i_sem
semaphore of the inode.
调用wake_up_interruptible_sync(
)以唤醒在管道等待队列上休眠的所有写入进程。
Invokes wake_up_interruptible_sync(
) to wake up all writer processes sleeping on the pipe's
wait queue.
返回复制到用户地址空间的字节数。
Returns the number of bytes copied into the user address space.
希望将数据放入管道的进程发出
write( )系统调用,指定管道写入端的文件描述符。内核通过调用write适当的文件对象的方法来满足此请求;表中相应的条目write_pipe_fops指向该pipe_write( )函数。
A process wishing to put data into a pipe issues a
write( ) system call, specifying
the file descriptor for the writing end of the pipe. The kernel
satisfies this request by invoking the write method of the proper file object; the
corresponding entry in the write_pipe_fops table points to the pipe_write( ) function.
表 19-4
总结了 POSIX 标准指定的系统调用的行为,该
write( )系统调用请求将n 个字节写入缓冲区中具有
u 个未使用字节的管道中。特别是,该标准要求涉及少量字节的写入操作必须以原子方式执行。更准确地说,如果两个或多个进程同时写入对于一个管道,涉及少于 4,096 字节(管道缓冲区大小)的每个写入操作必须完成,而不能与其他进程对同一管道的写入操作交错。然而,涉及超过 4,096 字节的写入操作可能是非原子的,并且还可能迫使调用进程休眠。
Table 19-4
summarizes the behavior, specified by the POSIX standard, of a
write( ) system call that requested
to write n bytes into a pipe having
u unused bytes in its buffer. In particular, the
standard requires that write operations involving a small number of
bytes must be atomically executed. More precisely, if two or more
processes are concurrently writing into a pipe, each write operation involving fewer than 4,096
bytes (the pipe buffer size) must finish without being interleaved
with write operations of other processes to the same pipe. However,
write operations involving more than 4,096 bytes may be nonatomic and
may also force the calling process to sleep.
表 19-4。将 n 个字节写入管道
Table 19-4. Writing n bytes to a pipe
至少一个阅读过程 At least one reading process | |||
|---|---|---|---|
可用缓冲空间u Available buffer space u | 阻塞写入 Blocking write | 非阻塞写入 Nonblocking write | 无读取过程 No reading process |
u < n ≤ 4,096 u<n≤ 4,096 | 等待 n - u字节被释放,复制n字节,并返回 n。 Wait until n-u bytes are freed, copy n bytes, and return n. | 返回 Return | 发送 Send |
n >4,096 n>4,096 | 复制n 个字节(必要时等待)并返回 n。 Copy n bytes (waiting when necessary) and return n. | 如果u >0,则复制
u字节并返回
u;返回 If u>0, copy
u bytes and return
u; return | |
u≥n _ _ u≥ n | 复制n 个字节并返回n。 Copy n bytes and return n. | ||
此外,如果管道没有读取进程(即,如果readers管道的 inode 对象的字段值为 0),则对管道的每个写入操作都必须失败。在这种情况下,内核SIGPIPE向写入进程发送一个信号,并write( )以-EPIPE错误代码终止系统调用,这通常会导致熟悉的“Broken pipeline”消息。
Moreover, each write operation to a pipe must fail if the pipe
does not have a reading process (that is, if the readers field of the pipe's inode object has
the value 0). In this case, the kernel sends a SIGPIPE signal to the writing process and
terminates the write( ) system call
with the -EPIPE error code, which
usually leads to the familiar "Broken pipe" message.
该pipe_write( )函数执行以下操作:
The pipe_write( ) function
performs the following operations:
获取i_sem
inode的信号量。
Acquires the i_sem
semaphore of the inode.
检查管道是否至少有一个读取进程。如果没有,它会SIGPIPE向进程发送一个信号current,释放 inode 信号量,并返回一个-EPIPE值。
Checks whether the pipe has at least one reading process. If
not, it sends a SIGPIPE signal
to the current process,
releases the inode semaphore, and returns an -EPIPE value.
curbuf通过添加数据结构的和nrbufs字段并减去 1来确定最后写入的管道缓冲区的索引。pipe_inode_info如果此管道缓冲区有足够的可用空间来存储要写入的所有字节,则它将数据复制到其中:
执行map
管道缓冲区的方法。
复制管道缓冲区中的所有字节。
执行unmap
管道缓冲区的方法。
更新len对应pipe_buffer对象的字段。
跳至步骤 11。
Determines the index of the last written pipe buffers by
adding the curbuf and nrbufs fields of the pipe_inode_info data structure and
subtracting 1. If this pipe buffer has enough free space to store
all the bytes to be written, then it copies the data into
it:
Executes the map
method of the pipe buffer.
Copies all the bytes in the pipe buffer.
Executes the unmap
method of the pipe buffer.
Updates the len field
of the corresponding pipe_buffer object.
Jumps to step 11.
如果数据结构nrbufs的字段pipe_inode_info等于16,则没有空管道缓冲区来存储要写入的字节。在这种情况下:
如果写操作是非阻塞的,则跳转到步骤11并通过返回-EAGAIN错误代码来终止。
waiting_writers如果写操作是阻塞的,则将结构体的字段加1
pipe_inode_info
,调用prepare_to_wait(
)添加current
到管道的等待队列(结构体wait的字段pipe_inode_info),释放inode信号量,并调用schedule( ). 一旦唤醒,它调用
从等待队列中
finish_wait( )删除
,再次获取 inode 信号量,减少该字段,然后跳回步骤 4。currentwaiting_writers
If the nrbufs field of
the pipe_inode_info data
structure is equal to 16, there is no empty pipe buffer to store
the bytes (yet) to be written. In this case:
If the write operation is nonblocking, it jumps to step
11 to terminate by returning the -EAGAIN error code.
If the write operation is blocking, it adds 1 to the
waiting_writers field of
the pipe_inode_info
structure, invokes prepare_to_wait(
) to add current
to the wait queue of the pipe (the wait field of the pipe_inode_info structure), releases
the inode semaphore, and invokes schedule( ). Once awake, it invokes
finish_wait( ) to remove
current from the wait
queue, again acquires the inode semaphore, decreases the
waiting_writers field, and
then jumps back to step 4.
现在至少有一个空的管道缓冲区。curbuf通过添加数据结构的和nrbufs字段来确定第一个空管道缓冲区的索引pipe_inode_info。
Now there is at least one empty pipe buffer. Determines the
index of the first empty pipe buffer by adding the curbuf and nrbufs fields of the pipe_inode_info data structure.
从伙伴系统分配一个新的页框,除非
数据结构tmp_page的字段pipe_inode_info不是
NULL。
Allocates a new page frame from the buddy system, unless the
tmp_page field of the pipe_inode_info data structure is not
NULL.
将最多 4,096 个字节从用户模式地址空间复制到页帧(如有必要,将其临时映射到内核模式线性地址空间)。
Copies up to 4,096 bytes from the User Mode address space into the page frame (temporarily mapping it in the Kernel Mode linear address space, if necessary).
pipe_buffer通过将页字段设置为页框描述符的地址、将字段设置为ops表的地址anon_pipe_buf_ops、将offset字段设置为 0、将len字段设置为写入字节数,来更新与管道缓冲区关联的对象的字段。
Updates the fields of the pipe_buffer object associated with the
pipe buffer by setting the page field to the address of the page
frame descriptor, the ops field
to the address of the anon_pipe_buf_ops table, the offset field to 0, and the len field to the number of written
bytes.
nrbufs增加存储在数据结构字段中的非空管道缓冲区的计数器
pipe_inode_info。
Increases the counter of nonempty pipe buffers stored in the
nrbufs field of the pipe_inode_info data structure.
如果未写入所有请求的字节,则跳回步骤 4。
If not all requested bytes were written, it jumps back to step 4.
释放 inode 信号量。
Releases the inode semaphore.
唤醒在管道等待队列上休眠的所有读取器进程。
Wakes up all reader processes sleeping on the pipe's wait queue.
返回写入管道缓冲区的字节数(如果无法写入,则返回错误代码)。
Returns the number of bytes written into the pipe's buffer (or an error code if writing was not possible).
[ * ]由于大多数 shell 提供仅连接两个进程的管道,因此需要两个以上进程使用管道的应用程序必须使用 C 等编程语言进行编码。
[*] Because most shells offer pipes that connect only two processes, applications requiring pipes used by more than two processes must be coded in a programming language such as C.
[ * ]正如我们将看到的,当与 FIFO 关联时,readers
和writers字段充当计数器而不是标志。
[*] As we'll see, the readers
and writers fields act as
counters instead of flags when associated with FIFOs.
[ * ]非阻塞操作通常通过O_NONBLOCK在
open( )系统调用中指定标志来请求。此方法不适用于管道,因为它们无法打开。fcntl( )然而,进程可以通过在相应的文件描述符上发出系统调用来要求在管道上进行非阻塞操作。
[*] Nonblocking operations are usually requested by specifying
the O_NONBLOCK flag in the
open( ) system call. This
method does not work for pipes, because they cannot be opened. A
process can, however, require a nonblocking operation on a pipe by
issuing a fcntl( ) system call
on the corresponding file descriptor.
尽管管道是一种简单、灵活且高效的通信机制,但它们有一个主要缺点,即无法打开已经存在的管道。这使得两个任意进程不可能共享同一个管道,除非该管道是由共同的祖先进程创建的。
Although pipes are a simple, flexible, and efficient communication mechanism, they have one main drawback—namely, that there is no way to open an already existing pipe. This makes it impossible for two arbitrary processes to share the same pipe, unless the pipe was created by a common ancestor process.
对于许多应用程序来说,这个缺点是很大的。例如,考虑一个数据库引擎服务器,它不断轮询希望发出某些查询的客户端进程,并将数据库查找的结果发送回给它们。服务器和给定客户端之间的每次交互都可以由管道处理。然而,当用户显式查询数据库时,客户端进程通常是由命令 shell 按需创建的;因此,服务器和客户端进程无法轻松共享管道。
This drawback is substantial for many application programs. Consider, for instance, a database engine server, which continuously polls client processes wishing to issue some queries and which sends the results of the database lookups back to them. Each interaction between the server and a given client might be handled by a pipe. However, client processes are usually created on demand by a command shell when a user explicitly queries the database; server and client processes thus cannot easily share a pipe.
为了解决这些限制,Unix 系统引入了一种称为命名管道或 FIFO 的特殊文件类型(代表“先进先出”;写入特殊文件的第一个字节也是读取的第一个字节)。每个 FIFO 很像一个管道:打开的 FIFO 与临时存储两个或多个进程交换的数据的内核缓冲区相关联,而不是在文件系统中拥有磁盘块。
To address such limitations, Unix systems introduce a special file type called a named pipe or FIFO (which stands for "first in, first out;" the first byte written into the special file is also the first byte that is read). Each FIFO is much like a pipe: rather than owning disk blocks in the filesystems, an opened FIFO is associated with a kernel buffer that temporarily stores the data exchanged by two or more processes.
然而,由于磁盘 inode,每个进程都可以访问 FIFO,因为 FIFO 文件名包含在系统的目录树中。因此,在我们的示例中,可以使用 FIFO 轻松建立服务器和客户端之间的通信而不是管道。服务器在启动时创建一个 FIFO,供客户端程序用来发出请求。每个客户端程序在建立连接之前都会创建另一个 FIFO,服务器程序可以将查询的答案写入其中,并在向服务器发出的初始请求中包含 FIFO 的名称。
Thanks to the disk inode, however, a FIFO can be accessed by every process, because the FIFO filename is included in the system's directory tree. Thus, in our example, the communication between server and clients may be easily established by using FIFOs instead of pipes. The server creates, at startup, a FIFO used by client programs to make their requests. Each client program creates, before establishing the connection, another FIFO to which the server program can write the answer to the query and includes the FIFO's name in the initial request to the server.
在Linux 2.6中,FIFO和管道几乎相同并且使用相同的pipe_inode_info结构。事实上, FIFO的操作方法read和文件操作方法与前面“从管道读取”和“写入管道”部分中描述的函数
和函数write相同。实际上,只有两个显着差异:pipe_read( )pipe_write( )
In Linux 2.6, FIFOs and pipes are almost identical and use the
same pipe_inode_info structures. As a
matter of fact, the read and write file operation methods of a FIFO are
implemented by the same pipe_read( )
and pipe_write( ) functions described
in the earlier sections "Reading from a Pipe" and
"Writing into a
Pipe." Actually, there are only two significant
differences:
因此,为了完成我们的描述,我们只需解释 FIFO 是如何创建和打开的。
To complete our description, therefore, we just have to explain how FIFOs are created and opened.
进程通过发出mknod( ) [ * ]系统调用(参见第 13 章中的“设备文件”部分)来创建 FIFO,将新 FIFO 的路径名和值( ) 作为参数传递给它,并与权限位掩码进行逻辑或运算。新文件。POSIX 引入了一个专门命名的函数来创建 FIFO。这个调用在Linux中实现,与System V中一样S_IFIFO0x10000mkfifo( )版本 4,作为调用
mknod( ).
A process creates a FIFO by issuing a mknod( ) [*] system call (see the section "Device Files" in Chapter 13), passing to it as
parameters the pathname of the new FIFO and the value S_IFIFO (0x10000) logically ORed with the permission
bit mask of the new file. POSIX introduces a function named mkfifo( ) specifically to create a FIFO.
This call is implemented in Linux, as in System V Release 4, as a C library function that invokes
mknod( ).
创建后,可以通过通常的open( )、read(
)、write( )和close( )系统调用来访问 FIFO,但 VFS 以特殊方式处理它,因为 FIFO inode 和文件操作是自定义的,不依赖于存储 FIFO 的文件系统。
Once created, a FIFO can be accessed through the usual open( ), read(
), write( ), and close( ) system calls, but the VFS handles
it in a special way, because the FIFO inode and file operations are
customized and do not depend on the filesystems in which the FIFO is
stored.
POSIX 标准指定了 FIFO 上的系统调用的行为open( );该行为主要取决于请求的访问类型、I/O 操作的类型(阻塞或非阻塞)以及是否存在其他进程访问 FIFO。
The POSIX standard specifies the behavior of the open( ) system call on FIFOs; the behavior
depends essentially on the requested access type, the kind of I/O
operation (blocking or nonblocking), and the presence of other
processes accessing the FIFO.
进程可以打开 FIFO 进行读取、写入或读取和写入。对于这三种情况,与相应文件对象关联的文件操作被设置为特殊方法。
A process may open a FIFO for reading, for writing, or for reading and writing. The file operations associated with the corresponding file object are set to special methods for these three cases.
当进程打开 FIFO 时,VFS 执行与设备文件相同的操作(请参阅第 13 章中的“ VFS 设备文件处理”部分)。与打开的 FIFO 关联的 inode 对象由文件系统相关的超级块方法初始化;此方法始终检查磁盘上的索引节点是否代表特殊文件,并在必要时调用该函数。反过来,该函数将inode 对象的字段设置为表的地址。随后,内核将文件对象的文件操作表设置为,并执行其方法,该方法由 实现
。read_inodeinit_special_inode( )i_fopdef_fifo_fopsdef_fifo_fopsopenfifo_open( )
When a process opens a FIFO, the VFS performs the same
operations as it does for device files (see the section "VFS Handling of Device
Files" in Chapter
13). The inode object associated with the opened FIFO is
initialized by a filesystem-dependent read_inode superblock method; this method
always checks whether the inode on disk represents a special file, and
invokes, if necessary, the init_special_inode( ) function. In turn,
this function sets the i_fop field
of the inode object to the address of the def_fifo_fops table. Later, the kernel sets
the file operation table of the file object to def_fifo_fops, and executes its open method, which is implemented by
fifo_open( ).
该fifo_open( )函数初始化 FIFO 特有的数据结构;特别是,它执行以下操作:
The fifo_open( ) function
initializes the data structures specific to the FIFO; in particular,
it performs the following operations:
获取i_seminode 信号量。
Acquires the i_sem inode
semaphore.
检查i_pipeinode对象的字段;如果是NULL,则分配并初始化一个新pipe_inode_info结构,如前面部分“创建和销毁管道”中的步骤 1b-1e 所示。
Checks the i_pipe field
of the inode object; if it is NULL, it allocates and it initializes a
new pipe_inode_info structure,
as in steps 1b-1e in the earlier section "Creating and Destroying a
Pipe."
取决于指定为参数的访问模式open( ) 系统调用时,它用正确的文件操作表的地址初始化f_op文件对象的字段(参见表19-5)。
Depending on the access mode specified as the parameter of
the open( ) system call, it initializes the f_op field of the file object with the
address of the proper file operation table (see Table 19-5).
如果访问模式是只读或读/写,则会向结构的readers和
r_counter字段
加 1 pipe_inode_info。而且,如果访问模式是只读并且没有其他读进程,它会唤醒任何休眠在等待队列中的写进程。
If the access mode is either read-only or read/write, it
adds one to the readers and
r_counter fields of the
pipe_inode_info structure.
Moreover, if the access mode is read-only and there is no other
reading process, it wakes up any writing process sleeping in the
wait queue.
如果访问模式是只写或读/写,则会向结构的writers和
w_counter字段
加 1 pipe_inode_info。而且,如果访问模式是只写并且没有其他写进程,它会唤醒任何在等待队列中休眠的读进程。
If the access mode is either write-only or read/write, it
adds one to the writers and
w_counter fields of the
pipe_inode_info structure.
Moreover, if the access mode is write-only and there is no other
writing process, it wakes up any reading process sleeping in the
wait queue.
如果没有读取器或写入器,它会决定该函数是否应该阻止或终止返回错误代码(参见表19-6)。
If there are no readers or no writers, it decides whether the function should block or terminate returning an error code (see Table 19-6).
Table 19-6. Behavior of the fifo_open( ) function
Access type | Blocking | Nonblocking |
|---|---|---|
Read-only, with writers | Successfully return | Successfully return |
Read-only, no writer | Wait for a writer | Successfully return |
Write-only, with readers | Successfully return | Successfully return |
Write-only, no reader | Wait for a reader | Return |
Read/write | Successfully return | Successfully return |
释放 inode 信号量并终止,返回 0(成功)。
Releases the inode semaphore, and terminates, returning 0 (success).
FIFO的三个专用文件操作表的区别主要在于实现read
和write方法上。如果访问类型允许读操作,则该read方法由函数实现pipe_read( )。否则,它由 实现bad_pipe_r( ),仅返回错误代码。同样,如果访问类型允许写操作,则该write方法由函数实现pipe_write( )
;否则,它由 实现bad_pipe_w( ),它也返回一个错误代码。
The FIFO's three specialized file operation tables differ mainly
in the implementation of the read
and write methods. If the access
type allows read operations, the read method is implemented by the pipe_read( ) function. Otherwise, it is
implemented by bad_pipe_r( ), which
only returns an error code. Similarly, if the access type allows write
operations, the write method is
implemented by the pipe_write( )
function; otherwise, it is implemented by bad_pipe_w( ), which also returns an error
code.
IPC 是进程间通信的缩写,通常指一组允许用户模式进程执行以下操作的机制:
IPC is an abbreviation for Interprocess Communication and commonly refers to a set of mechanisms that allow a User Mode process to do the following:
通过信号量将自身与其他进程同步
Synchronize itself with other processes by means of semaphores
向其他进程发送消息或从它们接收消息
Send messages to other processes or receive messages from them
与其他进程共享内存区域
Share a memory area with other processes
System V IPC 首次出现在一个名为“Columbus Unix”的 Unix 变体中”,后来被 AT&T 的 System III 采用。现在大多数 Unix 系统中都可以找到它,包括 Linux。
System V IPC first appeared in a development Unix variant called "Columbus Unix " and later was adopted by AT&T's System III . It is now found in most Unix systems, including Linux.
当进程请求IPC 资源(信号量、消息队列或共享内存区域)时,IPC 数据结构会动态创建。IPC 资源是持久的:除非被进程显式删除,否则它会保留在内存中并保持可用,直到系统关闭。IPC 资源可以被每个进程使用,包括那些不共享创建该资源的祖先的进程。
IPC data structures are created dynamically when a process requests an IPC resource (a semaphore, a message queue, or a shared memory region). An IPC resource is persistent: unless explicitly removed by a process, it is kept in memory and remains available until the system is shut down. An IPC resource may be used by every process, including those that do not share the ancestor that created the resource.
因为一个进程可能需要多个IPC资源对于同一类型,每个新资源都由一个32位IPC密钥来标识,该密钥类似于系统目录树中的文件路径名。每个 IPC 资源还有一个 32 位IPC 标识符,这有点类似于与打开的文件关联的文件描述符。IPC 标识符由内核分配给IPC资源并且在系统内是唯一的,而IPC密钥可以由程序员自由选择。
Because a process may require several IPC resources of the same type, each new resource is identified by a 32-bit IPC key, which is similar to the file pathname in the system's directory tree. Each IPC resource also has a 32-bit IPC identifier, which is somewhat similar to the file descriptor associated with an open file. IPC identifiers are assigned to IPC resources by the kernel and are unique within the system, while IPC keys can be freely chosen by programmers.
当两个或多个进程希望通过IPC资源进行通信时,它们都引用该资源的IPC标识符。
When two or more processes wish to communicate through an IPC resource, they all refer to the IPC identifier of the resource.
IPC 资源是通过调用semget( )、msgget(
)、 或shmget( )
函数来创建的,具体取决于新资源是信号量、消息队列还是共享内存区域。
IPC resources are created by invoking the semget( ), msgget(
), or shmget( )
functions, depending on whether the new resource is a semaphore, a
message queue, or a shared memory region.
这三个函数的主要目标是从 IPC 密钥(作为第一个参数传递)派生出相应的 IPC 标识符,然后进程使用该标识符来访问资源。如果没有已与 IPC 密钥关联的 IPC 资源,则会创建新资源。如果一切顺利,该函数将返回一个正的 IPC 标识符;否则,它返回表 19-7中列出的错误代码之一。
The main objective of each of these three functions is to derive from the IPC key (passed as the first parameter) the corresponding IPC identifier, which is then used by the process for accessing the resource. If there is no IPC resource already associated with the IPC key, a new resource is created. If everything goes right, the function returns a positive IPC identifier; otherwise, it returns one of the error codes listed in Table 19-7.
表 19-7。请求 IPC 标识符时返回错误代码
Table 19-7. Error codes returned while requesting an IPC identifier
错误代码 Error code | 描述 Description |
|---|---|
| 进程没有适当的访问权限 Process does not have proper access rights |
| 进程尝试使用与已存在的密钥相同的密钥创建 IPC 资源 Process tried to create an IPC resource with the same key as one that already exists |
|
Invalid argument in a parameter of
|
| 不存在具有所请求密钥的 IPC 资源,并且进程未要求创建它 No IPC resource with the requested key exists and the process did not ask to create it |
| 没有更多的存储空间可供额外的 IPC 资源使用 No more storage is left for an additional IPC resource |
| 已超出 IPC 资源数量最大限制 Maximum limit on the number of IPC resources has been exceeded |
假设两个独立的进程想要共享一个公共的 IPC 资源。这可以通过两种可能的方式来实现:
Assume that two independent processes want to share a common IPC resource. This can be achieved in two possible ways:
这些进程就某些固定的、预定义的 IPC 密钥达成一致。这是最简单的情况,对于由许多进程实现的每个复杂应用程序来说,它都非常有效。但是,另一个不相关的程序有可能选择相同的 IPC 密钥。在这种情况下,IPC函数可能被成功调用,但仍然返回错误资源的IPC标识符。[ * ]
The processes agree on some fixed, predefined IPC key. This is the simplest case, and it works quite well for every complex application implemented by many processes. However, there's a chance that the same IPC key is chosen by another unrelated program. In this case, the IPC functions might be successfully invoked and still return the IPC identifier of the wrong resource.[*]
一个进程通过指定其 IPC 密钥来发出semget(
)、msgget( )或
shmget( )函数IPC_PRIVATE。这样就分配了一个新的 IPC 资源,并且该进程可以将其 IPC 标识符传递给应用程序[ † ]中的其他进程,或者派生其他进程本身。此方法可确保 IPC 资源不会被其他应用程序意外使用。
One process issues a semget(
), msgget( ), or
shmget( ) function by
specifying IPC_PRIVATE as its
IPC key. A new IPC resource is thus allocated, and the process can
either communicate its IPC identifier to the other process in the
application[†] or fork the other process itself. This method
ensures that the IPC resource cannot be used accidentally by other
applications.
semget(
)、msgget( )和
函数的最后一个参数shmget( )可以包含三个标志。IPC_CREAT指定必须创建 IPC 资源(如果尚不存在);
IPC_EXCL指定如果资源已经存在并且IPC_CREAT设置了标志,则该函数必须失败;IPC_NOWAIT指定进程在访问 IPC 资源时(通常是获取消息或获取信号量时)不应阻塞。
The last parameter of the semget(
), msgget( ), and
shmget( ) functions can include
three flags. IPC_CREAT specifies
that the IPC resource must be created, if it does not already exist;
IPC_EXCL specifies that the
function must fail if the resource already exists and the IPC_CREAT flag is set; IPC_NOWAIT specifies that the process should
never block when accessing the IPC resource (typically, when fetching
a message or when acquiring a semaphore).
即使进程使用IPC_CREAT和IPC_EXCL标志,也无法确保对 IPC 资源的独占访问,因为其他进程可能始终通过使用其 IPC 标识符来引用该资源。
Even if the process uses the IPC_CREAT and IPC_EXCL flags, there is no way to ensure
exclusive access to an IPC resource, because other processes may
always refer to the resource by using its IPC identifier.
为了最大限度地减少错误引用错误资源的风险,内核不会在 IPC 标识符空闲后立即回收它们。相反,分配给资源的IPC标识符几乎总是大于分配给先前分配的相同类型资源的标识符。(唯一的例外发生在 32 位 IPC 标识符溢出时。)每个 IPC 标识符都是通过组合相 对于资源类型的槽使用序列号、已分配资源的任意槽索引以及在内核中选择的任意值来计算的这大于可分配资源的最大数量。如果我们选择s表示时隙使用序号,M表示可分配资源数量的上限, i表示时隙索引,其中 0≤ i < M,每个 IPC 资源的 ID 计算如下:
To minimize the risk of incorrectly referencing the wrong resource, the kernel does not recycle IPC identifiers as soon as they become free. Instead, the IPC identifier assigned to a resource is almost always larger than the identifier assigned to the previously allocated resource of the same type. (The only exception occurs when the 32-bit IPC identifier overflows.) Each IPC identifier is computed by combining a slot usage sequence number relative to the resource type, an arbitrary slot index for the allocated resource, and an arbitrary value chosen in the kernel that is greater than the maximum number of allocatable resources. If we choose s to represent the slot usage sequence number, M to represent the upper bound on the number of allocatable resources, and i to represent the slot index, where 0≤i<M, each IPC resource's ID is computed as follows:
IPC 标识符 = s × M + i
IPC identifier = s × M + i
在Linux 2.6中, M的值设置为32,768(IPCMNI宏)。时隙使用序列号s被初始化为0并且在每次资源分配时加1。当
s达到预定义的阈值(取决于 IPC 资源的类型)时,它会从 0 重新开始。
In Linux 2.6, the value of M is set to
32,768 (IPCMNI macro). The slot
usage sequence number s is initialized to 0 and
is increased by 1 at every resource allocation. When
s reaches a predefined threshold, which depends
on the type of IPC resource, it restarts from 0.
每种类型的 IPC 资源(信号量、消息队列和共享内存区域)都拥有一个数据结构,其中包括表 19-8ipc_ids中所示的字段。
Every type of IPC resource (semaphores, message queues, and
shared memory areas) owns an ipc_ids data structure, which includes the
fields shown in Table
19-8.
表 19-8。ipc_ids数据结构的字段
Table 19-8. The fields of the ipc_ids data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 分配的IPC资源数量 Number of allocated IPC resources |
| | 使用中的最大槽索引 Maximum slot index in use |
| | 下一次分配的时隙使用序列号 Slot usage sequence number for the next allocation |
| | 最大槽位使用序列号 Maximum slot usage sequence number |
| | 信号量保护 Semaphore protecting the |
| |
Fake data structure pointed to by
the |
| | 指向 Pointer to the |
数据ipc_id_ary结构由两个字段组成:p和
size。该p字段是指向数据结构的指针数组kern_ipc_perm,每个可分配资源对应一个。该size
字段是该数组的大小。最初,该数组存储 1、16 或 128 个指针,分别用于共享内存区域、消息队列和信号量。当数组变得太小时,内核会动态增加数组的大小。但是,每种给定类型的资源数量都有上限。系统管理员可以通过写入/proc/sys/kernel/sem、/proc/sys/kernel/msgmni和/proc/sys/kernel/shmmni来更改这些界限文件,分别。
The ipc_id_ary data structure
consists of two fields: p and
size. The p field is an array of pointers to kern_ipc_perm data structures, one for every
allocatable resource. The size
field is the size of this array. Initially, the array stores 1, 16, or
128 pointers, respectively for shared memory regions, message queues,
and semaphores. The kernel dynamically increases the size of the array
when it becomes too small. However, there is an upper bound on the
number of resources for each given type. The system administrator may
change these bounds by writing into the /proc/sys/kernel/sem, /proc/sys/kernel/msgmni, and /proc/sys/kernel/shmmni files,
respectively.
每个数据结构都与一个IPC 资源相关联,并包含表19-9kern_ipc_perm中所示的字段。、、和字段分别存储资源创建者的用户和组标识符以及当前资源所有者的用户和组标识符。位掩码
包括六个标志,它们存储资源所有者、资源组和所有其他用户的读写访问权限。IPC访问权限类似于第一章“访问权限和文件模式”一节中描述的文件访问权限uidgidcuidcgidmode,但不使用执行权限标志。
Each kern_ipc_perm data
structure is associated with an IPC resource and contains the fields
shown in Table
19-9. The uid, gid, cuid, and cgid fields store the user and group
identifiers of the resource's creator and the user and group
identifiers of the current resource's owner, respectively. The
mode bit mask includes six flags,
which store the read and write access permissions for the resource's
owner, the resource's group, and all other users. IPC access
permissions are similar to file access permissions described in the
section "Access Rights and
File Mode" in Chapter
1, except that the Execute permission flag is not used.
表 19-9。kern_ipc_ perm 结构中的字段
Table 19-9. The fields in the kern_ipc_ perm structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
自旋锁_t spinlock_t | 锁 lock | 自旋锁保护IPC资源描述符 Spin lock protecting the IPC resource descriptor |
整数 int | 已删除 deleted | 如果资源已被释放则设置标志 Flag set if the resource has been released |
| | 工控机密钥 IPC key |
| | 所有者用户 ID Owner user ID |
| | 所有者组ID Owner group ID |
| | 创建者用户ID Creator user ID |
| | 创建者组ID Creator group ID |
| | 权限位掩码 Permission bit mask |
| | 槽位使用顺序号 Slot usage sequence number |
空白 * void * | 安全 security | 指向安全结构的指针(由 SELinux 使用) Pointer to a security structure (used by SELinux) |
该kern_ipc_perm数据结构还包括key字段(包含相应资源的IPC密钥)和字段
(存储用于计算资源的IPC标识符的seq时隙使用序列号) 。
The kern_ipc_perm data
structure also includes a key field
(which contains the IPC key of the corresponding resource) and a
seq field (which stores the slot
usage sequence number s used to compute the IPC
identifier of the resource).
、semctl( )、msgctl( )和shmctl( )函数可用于处理 IPC 资源。该IPC_SET
子命令允许进程更改所有者的用户和组标识符以及ipc_perm数据结构中的权限位掩码。和IPC_STAT子IPC_INFO命令检索有关资源的一些信息。最后,该子IPC_RMID命令释放 IPC 资源。根据 IPC 资源的类型,还可以使用其他专用子命令。[ * ]
The semctl( ), msgctl( ), and shmctl( ) functions may be used to handle
IPC resources. The IPC_SET
subcommand allows a process to change the owner's user and group
identifiers and the permission bit mask in the ipc_perm data structure. The IPC_STAT and IPC_INFO subcommands retrieve some
information concerning a resource. Finally, the IPC_RMID subcommand releases an IPC
resource. Depending on the type of IPC resource, other specialized
subcommands are also available.[*]
一旦创建了 IPC 资源,进程就可以通过一些专门的函数对资源进行操作。进程可以通过发出semop( )函数来获取或释放 IPC 信号量。当进程想要发送或接收 IPC 消息时,它分别使用msgsnd( )和msgrcv( )函数。最后,进程分别通过 和shmat(
)函数在其地址空间中附加和分离 IPC 共享内存区域shmdt( )。
Once an IPC resource is created, a process may act on the
resource by means of a few specialized functions. A process may
acquire or release an IPC semaphore by issuing the semop( ) function. When a process wants to
send or receive an IPC message, it uses the msgsnd( ) and msgrcv( ) functions, respectively. Finally,
a process attaches and detaches an IPC shared memory region in its
address space by means of the shmat(
) and shmdt( ) functions,
respectively.
所有IPC功能必须通过合适的Linux系统调用来实现。实际上,在80×86架构中,只有一个名为 的IPC系统调用ipc(
)。比方说,当进程调用 IPC 函数时,
msgget( )它实际上调用了 C 库中的包装函数。这又通过ipc( )向系统调用传递所有参数msgget( )以及适当的子命令代码(在本例中为 )来调用系统调用MSGGET。服务sys_ipc( )例程检查子命令代码并调用实现所请求服务的内核函数。
All IPC functions must be implemented through suitable
Linux system calls. Actually, in the 80 × 86 architecture, there is
just one IPC system call named ipc(
). When a process invokes an IPC function, let's say
msgget( ), it really invokes a
wrapper function in the C library. This in turn invokes the ipc( ) system call by passing to it all the
parameters of msgget( ) plus a
proper subcommand code—in this case, MSGGET. The sys_ipc( ) service routine examines the
subcommand code and invokes the kernel function that implements the
requested service.
“多路复用ipc( )器”系统调用是旧 Linux 版本的遗产,它在动态模块中包含 IPC 代码(请参阅附录 B)。system_call在表中为可能丢失的内核组件保留多个系统调用条目没有多大意义,因此内核设计者采用了多路复用器方法。
The ipc( ) "multiplexer"
system call is a legacy from older Linux versions, which included the
IPC code in a dynamic module (see Appendix B). It did not make much
sense to reserve several system call entries in the system_call table for a kernel component
that could be missing, so the kernel designers adopted the multiplexer
approach.
如今,System V IPC 无法再编译为动态模块,并且没有理由使用单个 IPC 系统调用。事实上,Linux 为惠普的 Alpha 架构和英特尔的 IA-64 上的每个 IPC 功能提供了一个系统调用。
Nowadays, System V IPC can no longer be compiled as a dynamic module, and there is no justification for using a single IPC system call. As a matter of fact, Linux provides one system call for each IPC function on Hewlett-Packard's Alpha architecture and on Intel's IA-64.
IPC信号量与第 5 章中介绍的内核信号量非常相似;它们是用于为多个进程提供对共享数据结构的受控访问的计数器。
IPC semaphores are quite similar to the kernel semaphores introduced in Chapter 5; they are counters used to provide controlled access to shared data structures for multiple processes.
如果受保护资源可用,则信号量值为正;如果受保护资源当前不可用,则信号量值为 0。想要访问资源的进程会尝试减小信号量值;然而,内核会阻塞该进程,直到对信号量的操作产生正值为止。当进程放弃受保护的资源时,它会增加其信号量值;这样做时,等待信号量的任何其他进程都会被唤醒。
The semaphore value is positive if the protected resource is available, and 0 if the protected resource is currently not available. A process that wants to access the resource tries to decrease the semaphore value; the kernel, however, blocks the process until the operation on the semaphore yields a positive value. When a process relinquishes a protected resource, it increases its semaphore value; in doing so, any other process waiting for the semaphore is woken up.
实际上,IPC 信号量比内核信号量处理起来更复杂,主要原因有两个:
Actually, IPC semaphores are more complicated to handle than kernel semaphores for two main reasons:
每个 IPC 信号量都是一组一个或多个信号量值,而不仅仅是像内核信号量那样的单个值。这意味着同一个IPC资源可以保护多个独立的共享数据结构。semget( )分配资源时,必须将每个 IPC 信号量中的信号量值的数量指定为函数的参数。从现在开始,我们将 IPC 信号量内的计数器称为原始信号量
。IPC 信号量资源的数量(默认为 128)和单个 IPC 信号量资源内的原始信号量的数量(默认为 250)都有限制;然而,系统管理员可以通过写入/proc来轻松修改这些界限/sys/kernel/sem文件。
Each IPC semaphore is a set of one or more semaphore values,
not just a single value like a kernel semaphore. This means that
the same IPC resource can protect several independent shared data
structures. The number of semaphore values in each IPC semaphore
must be specified as a parameter of the semget( ) function when the resource is
being allocated. From now on, we'll refer to the counters inside
an IPC semaphore as primitive semaphores
. There are bounds both on the number of IPC
semaphore resources (by default, 128) and on the number of
primitive semaphores inside a single IPC semaphore resource (by
default, 250); however, the system administrator can easily modify
these bounds by writing into the /proc /sys/kernel/sem file.
System V IPC 信号量为进程终止而无法撤消之前对信号量发出的操作的情况提供了一种故障安全机制。当进程选择使用此机制时,生成的操作称为可撤消信号量操作。当进程终止时,其所有 IPC 信号量都可以恢复为进程从未开始操作时的值。这可以帮助防止使用相同信号量的其他进程由于终止进程无法手动撤消其信号量操作而无限期地保持阻塞状态。
System V IPC semaphores provide a fail-safe mechanism for situations in which a process dies without being able to undo the operations that it previously issued on a semaphore. When a process chooses to use this mechanism, the resulting operations are called undoable semaphore operations. When the process dies, all of its IPC semaphores can revert to the values they would have had if the process had never started its operations. This can help prevent other processes that use the same semaphores from remaining blocked indefinitely as a consequence of the terminating process failing to manually undo its semaphore operations.
首先,我们将简要概述希望访问受 IPC 信号量保护的一个或多个资源的进程所执行的典型步骤:
First, we'll briefly sketch the typical steps performed by a process wishing to access one or more resources protected by an IPC semaphore:
调用semget( )
包装函数来获取IPC信号量标识符,指定保护共享资源的IPC信号量的IPC密钥作为参数。如果进程想要创建一个新的 IPC 信号量,它还指定IPC_CREATE或标志以及所需的原始信号量的数量(请参阅本章前面的“使用 IPC 资源IPC_PRIVATE”部分)。
Invokes the semget( )
wrapper function to get the IPC semaphore identifier, specifying
as the parameter the IPC key of the IPC semaphore that protects
the shared resources. If the process wants to create a new IPC
semaphore, it also specifies the IPC_CREATE or IPC_PRIVATE flag and the number of
primitive semaphores required (see the section "Using an IPC
Resource" earlier in this chapter).
调用semop( )
包装函数来测试和减少涉及的所有原始信号量值。如果所有测试都成功,则执行递减,函数终止,并且允许进程访问受保护的资源。如果某些信号量正在使用,则该进程通常会挂起,直到其他进程释放资源。该函数接收 IPC 信号量标识符、指定要在原始信号量上原子执行的操作的整数数组以及此类操作的数量作为其参数。可选地,该过程可以指定SEM_UNDO标志,如果进程退出而不释放原始信号量,则指示内核反转操作。
Invokes the semop( )
wrapper function to test and decrease all primitive semaphore
values involved. If all the tests succeed, the decrements are
performed, the function terminates, and the process is allowed to
access the protected resources. If some semaphores are in use, the
process is usually suspended until some other process releases the
resources. The function receives as its parameters the IPC
semaphore identifier, an array of integers specifying the
operations to be atomically performed on the primitive semaphores,
and the number of such operations. Optionally, the process may
specify the SEM_UNDO flag,
which instructs the kernel to reverse the operations, should the
process exit without releasing the primitive semaphores.
当放弃受保护的资源时,它
semop( )再次调用该函数以原子方式增加所有涉及的原始信号量。
When relinquishing the protected resources, it invokes the
semop( ) function again to
atomically increase all primitive semaphores involved.
(可选)它调用semctl(
)包装函数,指定IPC_RMID从系统中删除 IPC 信号量的命令。
Optionally, it invokes the semctl(
) wrapper function, specifying the IPC_RMID command to remove the IPC
semaphore from the system.
现在我们可以讨论内核是如何实现IPC信号量的。涉及的数据结构如图19-1所示。该sem_ids变量存储ipc_idsIPC信号量资源类型的数据结构;相应的ipc_id_ary数据结构包含一组指向数据结构的指针sem_array,每个 IPC 信号量资源都有一个项目。
Now we can discuss how the kernel implements IPC semaphores. The
data structures involved are shown in Figure 19-1. The sem_ids variable stores the ipc_ids data structure of the IPC semaphore
resource type; the corresponding ipc_id_ary data structure contains an array
of pointers to sem_array data
structures, one item for every IPC semaphore resource.
形式上,数组存储指向kern_ipc_perm数据结构的指针,但每个结构只是sem_array数据结构的第一个字段。数据结构的所有字段
如表19-10sem_array所示。
Formally, the array stores pointers to kern_ipc_perm data structures, but each
structure is simply the first field of the sem_array data structure. All fields of the
sem_array data structure are shown
in Table
19-10.
表 19-10。sem_array 数据结构中的字段
Table 19-10. The fields in the sem_array data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | |
| | 最后的时间戳 Timestamp of last |
| | 上次更改的时间戳 Timestamp of last change |
| | 指向第一个 Pointer to first |
| | 待处理的操作 Pending operations |
| | 最后一次挂起的操作 Last pending operation |
| | 撤消请求 Undo requests |
| | 数组中信号量的数量 Number of semaphores in array |
该sem_base字段指向一组sem数据结构,每个 IPC 原语信号量都有一个。后一种数据结构仅包含两个字段:
The sem_base field points to
an array of sem data structures,
one for every IPC primitive semaphore. The latter data structure
includes only two fields:
semvalsemval信号量计数器的值。
The value of the semaphore's counter.
sempidsempid最后访问信号量的进程的 PID。进程可以通过包装函数查询该值semctl( )。
The PID of the last process that accessed the semaphore.
This value can be queried by a process through the semctl( ) wrapper function.
如果一个进程突然中止,它无法撤消它启动的操作(例如,释放它保留的信号量);因此,通过声明它们不可撤销,进程可以让内核将信号量返回到一致状态,并允许其他进程继续进行。进程可以通过SEM_UNDO在semop( )函数中指定标志来请求可撤消的操作。
If a process aborts suddenly, it cannot undo the
operations that it started (for instance, release the semaphores it
reserved); so by declaring them undoable, the process lets the
kernel return the semaphores to a consistent state and allow other
processes to proceed. Processes can request undoable operations by
specifying the SEM_UNDO flag in
the semop( ) function.
帮助内核反转给定进程在给定 IPC 信号量资源上执行的可撤消操作的信息存储在数据sem_undo结构中。它本质上包含信号量的 IPC 标识符和一个整数数组,该数组表示由进程执行的所有可撤消操作引起的原始信号量值的更改。
Information to help the kernel reverse the undoable operations
performed by a given process on a given IPC semaphore resource is
stored in a sem_undo data
structure. It essentially contains the IPC identifier of the
semaphore and an array of integers representing the changes to the
primitive semaphore's values caused by all undoable operations
performed by the process.
一个简单的例子可以说明如何sem_undo使用这些元素。考虑一个使用包含四个原始信号量的 IPC 信号量资源的进程。假设它调用该semop( )函数将第一个计数器增加 1,将第二个计数器减少 2。如果指定该
SEM_UNDO标志,则数据结构中第一个数组元素中的整数sem_undo减 1,第二个元素中的整数增加2,其他两个整数保持不变。由同一进程对 IPC 信号量执行的进一步可撤消操作会更改存储在sem_undo相应的结构。当进程退出时,该数组中的任何非零值都对应于相应原始信号量上的一个或多个不平衡操作;内核反转这些操作,只需将非零值添加到相应信号量的计数器中。换句话说,中止进程所做的更改将被取消,而其他进程所做的更改仍反映在信号量的状态中。
A simple example can illustrate how such sem_undo elements are used. Consider a
process that uses an IPC semaphore resource containing four
primitive semaphores. Suppose that it invokes the semop( ) function to increase the first
counter by 1 and decrease the second by 2. If it specifies the
SEM_UNDO flag, the integer in the
first array element in the sem_undo data structure is decreased by 1,
the integer in the second element is increased by 2, and the other
two integers are left unchanged. Further undoable operations on the
IPC semaphore performed by the same process change the integers
stored in the sem_undo structure
accordingly. When the process exits, any nonzero value in that array
corresponds to one or more unbalanced operations on the
corresponding primitive semaphore; the kernel reverses these
operations, simply adding the nonzero value to the corresponding
semaphore's counter. In other words, the changes made by the aborted
process are backed out while the changes made by other processes are
still reflected in the state of the semaphores.
对于每个进程,内核都会跟踪使用可撤消操作处理的所有信号量资源,以便在进程意外退出时可以回滚它们。此外,对于每个信号量,内核必须跟踪其所有sem_undo结构,以便每当进程使用semctl(
)将显式值强制输入原始信号量的计数器或销毁 IPC 信号量资源时,内核都可以快速访问它们。
For each process, the kernel keeps track of all semaphore
resources handled with undoable operations so that it can roll them
back if the process unexpectedly exits. Furthermore, for each
semaphore, the kernel has to keep track of all its sem_undo structures so it can quickly
access them whenever a process uses semctl(
) to force an explicit value into a primitive semaphore's
counter or to destroy an IPC semaphore resource.
内核能够有效地处理这些任务,这要归功于两个列表,我们将其表示为 每进程列表和 每信号量列表。第一个列表跟踪由给定进程操作的具有可撤消操作的所有信号量。第二个列表跟踪所有对给定信号量进行可撤消操作的进程。更确切地说:
The kernel is able to handle these tasks efficiently, thanks to two lists, which we denote as the per-process and the per-semaphore lists. The first list keeps track of all semaphores operated upon by a given process with undoable operations. The second list keeps track of all processes that are acting on a given semaphore with undoable operations. More precisely:
每个进程列表包括与sem_undo进程已执行可撤消操作的 IPC 信号量相对应的所有数据结构。进程描述符的字段sysvsem.undo_list指向一个类型为 的数据结构sem_undo_list,该结构又包含一个指向列表第一个元素的指针;proc_next每个数据结构的字段都指向sem_undo列表中的下一个元素。(如第 3 章中的“ clone()、fork() 和 vfork() 系统调用”部分所述,通过将标志传递给CLONE_SYSVSEMclone( ) 系统调用共享相同的可撤消信号量操作列表,因为它们共享相同的sem_undo_list描述符。)
The per-process list includes all sem_undo data structures corresponding
to IPC semaphores on which the process has performed undoable
operations. The sysvsem.undo_list field of the process
descriptor points to a data structure, of type sem_undo_list, which in turn contains
a pointer to the first element of the list; the proc_next field of each sem_undo data structure points to the
next element in the list. (As mentioned in the section "The clone( ), fork( ), and
vfork( ) System Calls" in Chapter 3, clone processes
created by passing the CLONE_SYSVSEM flag to the clone( ) system call share the same list of undoable
semaphore operations, because they share the same sem_undo_list descriptor.)
每个信号量列表包括sem_undo与对信号量执行可撤消操作的进程相对应的所有数据结构。undo数据结构的字段指向sem_array列表中的第一个元素,而id_next每个sem_undo数据结构的字段都指向列表中的下一个元素。
The per-semaphore list includes all sem_undo data structures corresponding
to the processes that performed undoable operations on the
semaphore. The undo field of
the sem_array data structure
points to the first element of the list, while the id_next field of each sem_undo data structure points to the
next element in the list.
当进程终止时,将使用每个进程列表。该
exit_sem( )函数由 调用do_exit( ),遍历列表并反转进程触及的每个 IPC 信号量的任何不平衡操作的影响。相比之下,每个信号量列表主要在进程调用函数
semctl( )以将显式值强制到原始信号量中时使用。内核将所有数组中对应的元素设置为0sem_undo引用该 IPC 信号量资源的数据结构,因为反转对该原始信号量执行的先前可撤消操作的影响不再有任何意义。此外,当 IPC 信号量被销毁时,也会使用每个信号量列表;通过将该字段设置为 -1,所有相关sem_undo数据结构都会失效。[ * ]semid
The per-process list is used when a process terminates. The
exit_sem( ) function, which is
invoked by do_exit( ), walks
through the list and reverses the effect of any unbalanced operation
for every IPC semaphore touched by the process. By contrast, the
per-semaphore list is mainly used when a process invokes the
semctl( ) function to force an
explicit value into a primitive semaphore. The kernel sets the
corresponding element to 0 in the arrays of all sem_undo data structures referring to that
IPC semaphore resource, because it would no longer make any sense to
reverse the effect of previous undoable operations performed on that
primitive semaphore. Moreover, the per-semaphore list is also used
when an IPC semaphore is destroyed; all related sem_undo data structures are invalidated
by setting the semid field to
-1.[*]
内核将待处理请求队列与每个 IPC 信号量相关联,以识别正在等待数组中一个(或多个)信号量的进程。队列是一个双向链表的数据结构,其字段如表19-11sem_queue所示。队列中的第一个和最后一个待处理请求分别由结构的和字段引用。最后一个字段允许像 FIFO 一样轻松地处理列表;新的待处理请求将添加到列表末尾,以便稍后为它们提供服务。待处理请求中最重要的字段是sem_pendingsem_pending_lastsem_arraynsops(它存储挂起操作中涉及的原始信号量的数量)和sops(它指向描述每个信号量操作的整数值数组)。该sleeper字段存储请求操作的睡眠进程的描述符地址。
The kernel associates a queue of pending
requests with each IPC semaphore to identify processes
that are waiting on one (or more) of the semaphores in the array.
The queue is a doubly linked list of sem_queue data structures whose fields are
shown in Table
19-11. The first and last pending requests in the queue are
referenced, respectively, by the sem_pending and sem_pending_last fields of the sem_array structure. This last field
allows the list to be handled as easily as a FIFO; new pending
requests are added to the end of the list so they will be serviced
later. The most important fields of a pending request are nsops (which stores the number of
primitive semaphores involved in the pending operation) and sops (which points to an array of integer
values describing each semaphore operation). The sleeper field stores the descriptor
address of the sleeping process that requested the operation.
表 19-11。sem_queue数据结构中的字段
Table 19-11. The fields in the sem_queue data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 指向下一个队列元素的指针 Pointer to next queue element |
| | 指向前一个队列元素的指针 Pointer to previous queue element |
| | 指向请求信号量操作的休眠进程的指针 Pointer to the sleeping process that requested the semaphore operation |
| |
Pointer to |
| | 进程标识符 Process identifier |
| | 操作完成状态 Completion status of operation |
| | 指向 IPC 信号量描述符的指针 Pointer to IPC semaphore descriptor |
| | IPC信号量资源槽位索引 Slot index of the IPC semaphore resource |
| | 指向待处理操作数组的指针 Pointer to array of pending operations |
| | 待处理操作数 Number of pending operations |
整数 int | 改变 alter | 表示操作是否修改信号量数组的标志 Flag denoting whether the operation modifies the semaphore array |
图 19-1
说明了具有三个待处理请求的 IPC 信号量。第二、第三个请求是指不可撤销的操作,因此
数据结构undo的字段sem_queue指向对应的sem_undo结构;第一个待处理请求有一个NULL
undo字段,因为相应的操作不可撤消。
Figure 19-1
illustrates an IPC semaphore that has three pending requests. The
second and third requests refer to undoable operations, so the
undo field of the sem_queue data structure points to the
corresponding sem_undo structure;
the first pending request has a NULL
undo field because the corresponding operation is not
undoable.
进程之间可以通过 IPC 消息的方式进行通信。进程生成的每条消息都会发送到 IPC 消息队列,并保留在其中直到另一个进程读取它。
Processes can communicate with one another by means of IPC messages . Each message generated by a process is sent to an IPC message queue, where it stays until another process reads it.
消息由固定大小的 标头和可变长度的 文本组成;它可以用一个整数值(消息类型)来标记,这允许进程有选择地从其消息队列中检索消息。[ * ]一旦进程从 IPC 消息队列中读取了消息,内核就会销毁该消息;因此,只有一个进程可以接收给定的消息。
A message is composed of a fixed-size header and a variable-length text; it can be labeled with an integer value (the message type), which allows a process to selectively retrieve messages from its message queue.[*] Once a process has read a message from an IPC message queue, the kernel destroys the message; therefore, only one process can receive a given message.
要发送消息,进程调用该msgsnd( )函数,并传递以下参数作为参数:
To send a message, a process invokes the msgsnd( ) function, passing the following as
parameters:
目的消息队列的IPC标识符
The IPC identifier of the destination message queue
消息文本的大小
The size of the message text
用户模式缓冲区的地址,其中包含消息类型,后跟消息文本
The address of a User Mode buffer that contains the message type immediately followed by the message text
为了检索消息,进程调用该msgrcv( )函数,并向其传递:
To retrieve a message, a process invokes the msgrcv( ) function, passing to it:
IPC消息队列资源的IPC标识符
The IPC identifier of the IPC message queue resource
指向应将消息类型和消息文本复制到的用户模式缓冲区的指针
The pointer to a User Mode buffer to which the message type and message text should be copied
该缓冲区的大小
The size of this buffer
指定应检索什么消息的值t
A value t that specifies what message should be retrieved
如果t为0,则返回队列中的第一条消息。如果t为正数,则返回队列中类型等于 t的第一条消息。最后,如果t为负数,则该函数返回消息类型为小于或等于t 绝对值的最小值的第一条消息 。
If the value t is 0, the first message in the queue is returned. If t is positive, the first message in the queue with its type equal to t is returned. Finally, if t is negative, the function returns the first message whose message type is the lowest value less than or equal to the absolute value of t.
为了避免资源耗尽,对允许的 IPC 消息队列资源数量(默认为 16)、每条消息的大小(默认为 8,192 字节)以及一个消息队列中消息的最大总大小都有一些限制。队列(默认为 16,384 字节)。然而,像往常一样,系统管理员可以通过写入/proc来调整这些值分别是/sys/kernel/msgmni、/proc/sys/kernel/msgmnb和/proc/sys/kernel/msgmax文件。
To avoid resource exhaustion, there are some limits on the number of IPC message queue resources allowed (by default, 16), on the size of each message (by default, 8,192 bytes), and on the maximum total size of the messages in a queue (by default, 16,384 bytes). As usual, however, the system administrator can tune these values by writing into the /proc /sys/kernel/msgmni, /proc/sys/kernel/msgmnb, and /proc/sys/kernel/msgmax files, respectively.
与IPC消息队列相关的数据结构如图19-2所示。该msg_ids变量存储ipc_idsIPC消息队列资源类型的数据结构;相应的ipc_id_ary数据结构包含一个指向数据结构的指针数组shmid_kernel——每个 IPC 消息队列资源都有一个项目。形式上,数组存储指向kern_ipc_perm数据结构的指针,但每个这样的结构只是msg_queue数据结构的第一个字段。数据结构的所有字段
如表19-12msg_queue所示。
The data structures associated with IPC message queues are shown in Figure 19-2. The msg_ids variable stores the ipc_ids data structure of the IPC message
queue resource type; the corresponding ipc_id_ary data structure contains an array
of pointers to shmid_kernel data
structures—one item for every IPC message queue resource. Formally,
the array stores pointers to kern_ipc_perm data structures, but each such
structure is simply the first field of the msg_queue data structure. All fields of the
msg_queue data structure are shown
in Table
19-12.
表 19-12。msg_queue数据结构
Table 19-12. The msg_queue data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | |
| | 最后时间 Time of last |
| | 最后时间 Time of last |
| | 最后更改时间 Last change time |
| | 队列中的字节数 Number of bytes in queue |
| | 队列中的消息数 Number of messages in queue |
| | 队列中的最大字节数 Maximum number of bytes in queue |
| | 最后一个PID PID of last |
| | 最后一个PID PID of last |
| | 队列中的消息列表 List of messages in queue |
| | 接收消息的进程列表 List of processes receiving messages |
| | 发送消息的进程列表 List of processes sending messages |
最重要的字段是q_messages,它表示包含当前队列中所有消息的双向链接循环列表的头(即第一个虚拟元素)。
The most important field is q_messages, which represents the head (i.e.,
the first dummy element) of a doubly linked circular list containing
all messages currently in the queue.
每条消息都分为一个或多个页面,这些页面是动态分配的。第一页的开头存储消息头,它是类型为 的数据结构msg_msg;其字段如表19-13所示。该
m_list字段存储指向队列中上一条和下一条消息的指针。消息文本紧接着msg_msg描述符之后开始;如果消息长度超过 4,072 字节(页面大小减去描述符的大小msg_msg),它将在另一个页面上继续,该页面的地址存储在描述符next的字段中msg_msg。第二个页框以类型为 的描述符开始msg_msgseg,它只包含一个next存储可选第三页地址的指针,依此类推。
Each message is broken into one or more pages, which are
dynamically allocated. The beginning of the first page stores the
message header, which is a data structure of type msg_msg; its fields are listed in Table 19-13. The
m_list field stores the pointers to
the previous and next messages in the queue. The message text starts
right after the msg_msg descriptor;
if the message is longer than 4,072 bytes (the page size minus the
size of the msg_msg descriptor), it
continues on another page, whose address is stored in the next field of the msg_msg descriptor. The second page frame
starts with a descriptor of type msg_msgseg, which simply includes a next pointer storing the address of an
optional third page, and so on.
表 19-13。msg_msg 数据结构
Table 19-13. The msg_msg data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | 消息列表指针 Pointers for message list |
| | 消息类型 Message type |
| | 消息文字大小 Message text size |
| | 消息的下一部分 Next portion of the message |
空白 * void * | 安全 security | 指向安全数据结构的指针(由 SELinux 使用) Pointer to a security data structure (used by SELinux) |
当消息队列已满(达到最大消息数或最大总大小)时,尝试将新消息放入队列的进程可能会被阻塞。q_senders数据结构的字段是msg_queue列表的头部,其中包括指向所有阻塞发送进程的描述符的指针。
When the message queue is full (either the maximum number of
messages or the maximum total size has been reached), processes that
try to enqueue new messages may be blocked. The q_senders field of the msg_queue data structure is the head of a
list that includes the pointers to the descriptors of all blocked
sending processes.
当消息队列为空(或者进程指定队列中不存在的消息类型)时,甚至接收进程也可能被阻塞。q_receivers数据结构的字段是数据结构msg_queue列表的头部msg_receiver
,每个被阻塞的接收进程都有一个。这些结构中的每一个本质上都包括指向进程描述符的指针、指向msg_msg消息结构的指针以及所请求消息的类型。
Even receiving processes may be blocked when the message queue
is empty (or the process specified a type of message not present in
the queue). The q_receivers field
of the msg_queue data structure is
the head of a list of msg_receiver
data structures, one for every blocked receiving process. Each of
these structures essentially includes a pointer to the process
descriptor, a pointer to the msg_msg structure of the message, and the
type of the requested message.
最有用的IPC机制是共享内存,它允许两个或多个进程通过将某些公共数据结构放置在IPC 共享内存区域中来访问它们。每个想要访问 IPC 共享内存区域中包含的数据结构的进程必须向其地址空间添加一个新的内存区域(请参阅第9 章中的“内存区域”部分)框地区。然后,内核可以通过请求分页轻松处理此类页框(参见第 9 章中的“请求分页”部分)。
The most useful IPC mechanism is shared memory , which allows two or more processes to access some common data structures by placing them in an IPC shared memory region. Each process that wants to access the data structures included in an IPC shared memory region must add to its address space a new memory region (see the section "Memory Regions" in Chapter 9), which maps the page frames associated with the IPC shared memory region. Such page frames can then be easily handled by the kernel through demand paging (see the section "Demand Paging" in Chapter 9).
与信号量和消息队列一样,shmget( )调用该函数来获取共享内存区域的 IPC 标识符,如果该区域尚不存在,则可以选择创建它。
As with semaphores and message queues, the shmget( ) function is invoked to get the IPC
identifier of a shared memory region, optionally creating it if it
does not already exist.
调用该shmat( )函数将 IPC 共享内存区域“附加”到进程。它接收 IPC 共享内存资源的标识符作为其参数,并尝试将共享内存区域添加到调用进程的地址空间。调用进程可能需要内存区域的特定起始线性地址,但该地址通常不重要,并且访问共享内存区域的每个进程可以使用其自己的地址空间中的不同地址。该进程的页表保持不变shmat( )。我们稍后将描述当进程尝试访问属于新内存区域的页面时内核会做什么。
The shmat( ) function is
invoked to "attach" an IPC shared memory region to a process. It
receives as its parameter the identifier of the IPC shared memory
resource and tries to add a shared memory region to the address space
of the calling process. The calling process can require a specific
starting linear address for the memory region, but the address is
usually unimportant, and each process accessing the shared memory
region can use a different address in its own address space. The
process's Page Tables are left unchanged by shmat( ). We describe later what the kernel
does when the process tries to access a page that belongs to the new
memory region.
调用该shmdt( )函数来“分离”由其 IPC 标识符指定的 IPC 共享内存区域,即从进程的地址空间中删除相应的内存区域。回想一下,IPC 共享内存资源是持久的:即使没有进程使用它,相应的页面也不能被丢弃,尽管它们可以被换出。
The shmdt( ) function is
invoked to "detach" an IPC shared memory region specified by its IPC
identifier—that is, to remove the corresponding memory region from the
process's address space. Recall that an IPC shared memory resource is
persistent: even if no process is using it, the corresponding pages
cannot be discarded, although they can be swapped out.
至于其他类型的 IPC 资源,为了避免用户态进程过度使用内存,对 IPC 共享内存区域的允许数量(默认为 4,096)、每个段的大小(默认为,32 MB),以及所有段的最大总大小(默认情况下,8 GB)。然而,像往常一样,系统管理员可以通过分别写入/proc/sys/kernel/shmmni、/proc/sys/kernel/shmmax和/proc/sys/kernel/shmall文件来调整这些值。
As for the other types of IPC resources, in order to avoid overuse of memory by User Mode processes, there are some limits on the allowed number of IPC shared memory regions (by default, 4,096), on the size of each segment (by default, 32 megabytes), and on the maximum total size of all segments (by default, 8 gigabytes). As usual, however, the system administrator can tune these values by writing into the /proc/sys/kernel/shmmni, /proc/sys/kernel/shmmax, and /proc/sys/kernel/shmall files, respectively.
与IPC共享内存区域相关的数据结构如图19-3所示。该shm_ids变量存储ipc_idsIPC共享内存资源类型的数据结构;相应的ipc_id_ary数据结构包含一组指向数据结构的指针shmid_kernel,其中一项对应每个 IPC 共享内存资源。形式上,数组存储指向kern_ipc_perm数据结构的指针,但每个这样的结构只是msg_queue数据结构的第一个字段。数据结构的所有字段
如表19-14shmid_kernel所示。
The data structures associated with IPC shared memory regions
are shown in Figure
19-3. The shm_ids variable
stores the ipc_ids data structure
of the IPC shared memory resource type; the corresponding ipc_id_ary data structure contains an array
of pointers to shmid_kernel data
structures, one item for every IPC shared memory resource. Formally,
the array stores pointers to kern_ipc_perm data structures, but each such
structure is simply the first field of the msg_queue data structure. All fields of the
shmid_kernel data structure are
shown in Table
19-14.
表 19-14。shmid_kernel数据结构中的字段
Table 19-14. The fields in the shmid_kernel data structure
类型 Type | 场地 Field | 描述 Description |
|---|---|---|
| | |
| | 段的特殊文件 Special file of the segment |
| | 段的槽位索引 Slot index of the segment |
| | 当前连接数 Number of current attaches |
| | 段大小(以字节为单位) Segment size in bytes |
| | 最后访问时间 Last access time |
| | 上次分离时间 Last detach time |
| | 最后更改时间 Last change time |
| | 创建者的PID PID of creator |
| | 最后访问进程的PID PID of last accessing process |
结构体用户结构* struct user_struct * | mlock_用户 mlock_user | 指向在 RAM 中锁定共享内存资源的用户描述符的指针(请参阅第 3 章中的“ clone( )、fork( ) 和 vfork( ) 系统调用” Pointer to the |
最重要的字段是shm_file,它存储文件对象的地址。这体现了Linux 2.6中IPC共享内存与VFS层的紧密集成。特别是,每个 IPC 共享内存区域都与属于shm的文件相关联。
特殊文件系统(请参阅第 12 章中的“特殊文件系统”
部分)。
The most important field is shm_file, which stores the address of a file
object. This reflects the tight integration of IPC shared memory with
the VFS layer in Linux 2.6. In particular, each IPC shared memory
region is associated with a file belonging to the
shm special filesystem (see the section "Special Filesystems" in
Chapter 12).
由于shm文件系统在系统目录树中没有挂载点,因此任何用户都无法通过常规 VFS 系统调用来打开和访问其文件。然而,当进程“附加”一个段时,内核会do_mmap( )在进程的地址空间中调用并创建文件的新共享内存映射。因此,属于shm特殊文件系统的文件只有一个文件对象方法,mmap由该shm_mmap( )函数实现。
Because the shm filesystem has no mount
point in the system directory tree, no user can open and access its
files by means of regular VFS system calls. However, when a process
"attaches" a segment, the kernel invokes do_mmap( ) and creates a new shared memory
mapping of the file in the address space of the process. Therefore,
files that belong to the shm special filesystem
have just one file object method, mmap, which is implemented by the shm_mmap( ) function.
如图19-3所示,IPC共享内存区域对应的内存区域是通过vm_area_struct对象来描述的(参见第16章“内存映射”
一节);它的字段指向特殊文件系统中文件的文件对象,该文件对象又引用一个 dentry 对象和一个 inode 对象。inode编号,存储在
inode字段中,实际上是IPC共享内存区域的槽索引,因此inode对象间接引用了描述符。vm_filei_inoshmid_kernel
As shown in Figure
19-3, a memory region that corresponds to an IPC shared memory
region is described by a vm_area_struct object (see the section
"Memory Mapping" in
Chapter 16); its vm_file field points back to the file object
of the file in the special filesystem, which in turn references a
dentry object and an inode object. The inode number, stored in the
i_ino field of the inode, is
actually the slot index of the IPC shared memory region, so the inode
object indirectly references the shmid_kernel descriptor.
与往常一样,对于每个共享内存映射,页帧通过一个对象包含在页缓存中address_space,该对象嵌入在 inode 中并由i_mappinginode 的字段引用(您也可以参考图 16-2);如果页框属于 IPC 共享内存区域,则address_space对象的方法存储在
shmem_aops全局变量中。
As usual for every shared memory mapping, page frames are
included in the page cache through an address_space object, which is embedded in
the inode and referenced by the i_mapping field of the inode (you might also
refer to Figure
16-2); in case of page frames belonging to an IPC shared memory
region, the methods of the address_space object are stored in the
shmem_aops global variable.
内核在交换共享内存区域中包含的页面时必须小心,并且交换缓存的作用至关重要(这个主题已经在第 17 章的“交换缓存” 部分中讨论过)。
The kernel has to be careful when swapping out pages included in shared memory regions, and the role of the swap cache is crucial (this topic was already discussed in the section "The Swap Cache" in Chapter 17).
IPC 共享内存区域的页面是可交换的,但不可同步(请参见第 17 章中的表 17-1),因为它们映射磁盘上没有映像的特殊 inode。因此,为了回收IPC共享内存区域的页面,内核必须将其写入交换区域。由于 IPC 共享内存区域是持久性的(即,即使该段未附加到任何进程,其页面也必须保留),因此即使这些页面不再被任何进程使用,内核也不能简单地丢弃这些页面。
Pages of an IPC shared memory region are swappable—and not syncable (see Table 17-1 in Chapter 17)—because they map a special inode that has no image on disk. Thus, in order to reclaim a page of an IPC shared memory region, the kernel must write it into a swap area. Because an IPC shared memory region is persistent—that is, its pages must be preserved even when the segment is not attached to any process—the kernel cannot simply discard these pages even when they are no longer used by any process.
让我们看看 PFRA 如何执行 IPC 共享内存区域使用的页框回收。一切都按照第 17 章“低内存回收”部分中的描述完成,直到页面被考虑。因为该函数不包括对 IPC 共享内存区域页面的任何特殊检查,所以它最终会调用该函数以从用户模式地址空间中删除对页框的每个引用;正如第17章“反向映射”
部分所解释的,相应的页表条目被简单地清除。shrink_list( )try_to_unmap( )
Let us see how the PFRA performs the reclaiming of a page
frame used by an IPC shared memory region. Everything is done as
described in the section "Low On Memory
Reclaiming" in Chapter
17, until the page is considered by shrink_list( ). Because this function does
not include any special check for pages of IPC shared memory
regions, it ends up invoking the try_to_unmap( ) function to remove every
reference to the page frame from the User Mode address spaces; as
explained in the section "Reverse Mapping" in
Chapter 17, the
corresponding page table entries are simply cleared.
接下来,该shrink_list( )
函数检查PG_dirty页面的标志并调用pageout(
)——IPC共享内存区域的页框在分配时被标记为脏,因此pageout( )总是被调用。反过来,该
pageout( )函数调用
映射文件的对象writepage的方法。address_space
Next, the shrink_list( )
function checks the PG_dirty flag
of the page and invokes pageout(
)—page frames of IPC shared memory regions are marked
dirty when they are allocated, thus pageout( ) is always invoked. In turn, the
pageout( ) function invokes the
writepage method of the address_space object of the mapped
file.
该shmem_writepage( )
函数实现了writepageIPC共享内存区域页面的方法,本质上是在交换区域中分配一个新的页槽,并将该页面从页面缓存移动到交换缓存(这只是改变该页面的所有者对象的问题address_space)页)。该函数还将换出的页面标识符存储在shmem_inode_info体现IPC内存区域的inode对象的结构中,并再次设置PG_dirty页面的标志。如第 17 章中
的图 17-5所示
,该函数检查
标志并通过将页面保留在非活动列表中来中断回收过程。shrink_list( )PG_dirty
The shmem_writepage( )
function, which implements the writepage method for IPC shared memory
regions' pages, essentially allocates a new page slot in a swap
area, and moves the page from the page cache to the swap cache (it's
just a matter of changing the owner address_space object of the page). The
function also stores the swapped-out page identifier in a shmem_inode_info structure that embodies
the IPC memory region's inode object, and it sets again the PG_dirty flag of the page. As shown in
Figure 17-5 in
Chapter 17, the shrink_list( ) function checks the
PG_dirty flag and breaks the
reclaiming procedure by leaving the page in the inactive
list.
该页框迟早会被 PFRA 再次处理。该函数将再次shrink_list(
)尝试通过调用将页面刷新到磁盘
pageout( )。然而,这一次,该页包含在交换缓存中,因此它由
address_space交换子系统的对象“拥有” swapper_space。相应的writepage方法,有效地启动写入交换区域的操作(参见第17章中的“交换出页面”
swap_writepage( )部分)。一旦
终止,验证页面现在是否干净,将其从交换缓存中删除,并将其释放到伙伴系统。pageout( )shrink_list( )
Sooner or later, the page frame will be processed again by the
PFRA. Once again, the shrink_list(
) function will try to flush the page to disk by invoking
pageout( ). This time, however,
the page is included in the swap cache, thus it is "owned" by the
address_space object of the
swapping subsystem, swapper_space. The corresponding writepage method, swap_writepage( ), effectively starts the
write operation into the swap area (see the section "Swapping Out Pages" in
Chapter 17). Once
pageout( ) terminates, shrink_list( ) verifies that the page is
now clean, removes it from the swap cache, and releases it to the
buddy system.
添加到进程的页面shmat( )是虚拟页面;该函数将新的内存区域添加到进程的地址空间中,但不会修改进程的页表。此外,正如我们所见,IPC 共享内存区域的页面可以被换出。因此,这些页面是通过请求调页机制来处理的。
The pages added to a process by shmat( ) are dummy pages; the function
adds a new memory region into a process's address space, but it
doesn't modify the process's Page Tables. Moreover, as we have seen,
pages of an IPC shared memory region can be swapped out. Therefore,
these pages are handled through the demand paging mechanism.
众所周知,页面错误当进程尝试访问其底层页框尚未分配的 IPC 共享内存区域的位置时,会发生这种情况。相应的异常处理程序确定故障地址在进程地址空间内,并且相应的页表项为空;因此,它调用该
函数(参见第 9 章中的“请求分页”do_no_page( )部分)。反过来,该函数检查是否定义了内存区域的方法。调用该方法,并将页表条目设置为从该方法返回的地址(另请参阅中的“内存映射的请求分页”部分)nopage第16章)。
As we know, a Page Fault occurs when a process tries to access a location of
an IPC shared memory region whose underlying page frame has not been
assigned. The corresponding exception handler determines that the
faulty address is inside the process address space and that the
corresponding Page Table entry is null; therefore, it invokes the
do_no_page( ) function (see the
section "Demand
Paging" in Chapter
9). In turn, this function checks whether the nopage method for the memory region is
defined. That method is invoked, and the Page Table entry is set to
the address returned from it (see also the section "Demand Paging for Memory
Mapping" in Chapter
16).
用于 IPC 共享内存的内存区域始终定义
nopage方法。它由函数实现shmem_nopage( ),该函数执行以下操作:
Memory regions used for IPC shared memory always define the
nopage method. It is implemented
by the shmem_nopage( ) function,
which performs the following operations:
遍历VFS对象中的指针链并导出IPC共享内存资源的inode对象的地址(见图19-3)。
Walks the chain of pointers in the VFS objects and derives the address of the inode object of the IPC shared memory resource (see Figure 19-3).
vm_start根据内存区域描述符的字段和请求的地址计算段内的逻辑页号。
Computes the logical page number inside the segment from
the vm_start field of the
memory region descriptor and the requested address.
检查页面是否已经包含在页面缓存中;如果是,则通过返回其描述符的地址来终止。
Checks whether the page is already included in the page cache; if so, terminates by returning the address of its descriptor.
检查该页面是否包含在交换缓存中并且是最新的;如果是,则通过返回其描述符的地址来终止。
Checks whether the page is included in the swap cache and is up-to-date; if so, terminates by returning the address of its descriptor.
检查体现 inode 对象的是否shmem_inode_info存储逻辑页号的换出页标识符。如果是这样,它通过调用执行换入操作(参见第17章中的“页面交换”read_swap_cache_async(
)部分),等待数据传输完成,并通过返回页面描述符的地址来终止。
Checks whether the shmem_inode_info that embodies the
inode object stores a swapped-out page identifier for the
logical page number. If so, it performs a swap-in operation by
invoking read_swap_cache_async(
) (see the section "Swapping in Pages"
in Chapter 17), waits
until the data transfer completes, and terminates by returning
the address of the page descriptor.
否则,该页不会存储在交换区中;因此,该函数从伙伴系统分配一个新页面,将其插入页面缓存,并返回其地址。
Otherwise, the page is not stored in a swap area; therefore, the function allocates a new page from the buddy system, inserts it into the page cache, and returns its address.
该do_no_page( )函数设置进程页表中与错误地址相对应的条目,使其指向该方法返回的页框。
The do_no_page( ) function
sets the entry that corresponds to the faulty address in the
process's Page Table so that it points to the page frame returned by
the method.
[ * ]该ftok( )函数尝试从文件路径名和作为其参数传递的 8 位项目标识符创建新密钥。但是,它不能保证唯一的密钥号,因为它有很小的可能将相同的 IPC 密钥返回到使用不同路径名和项目标识符的两个不同应用程序。
[*] The ftok( ) function
attempts to create a new key from a file pathname and an 8-bit
project identifier passed as its parameters. It does not
guarantee, however, a unique key number, because there is a
small chance that it will return the same IPC key to two
different applications using different pathnames and project
identifiers.
[ † ]当然,这意味着进程之间存在另一个不基于 IPC 的通信通道。
[†] This implies, of course, the existence of another communication channel between the processes not based on IPC.
[ * ] IPC 设计缺陷是用户模式进程无法自动创建和初始化 IPC 信号量,因为这两个操作是由两个不同的 IPC 函数执行的。
[*] An IPC design flaw is that a User Mode process cannot atomically create and initialize an IPC semaphore, because these two operations are performed by two different IPC functions.
[ * ]请注意,它们只是失效而不是释放,因为从所有进程的每个进程列表中删除数据结构的成本太高。
[*] Notice that they are just invalidated and not freed, because it would be too costly to remove the data structures from the per-process lists of all processes.
[ * ]正如我们将看到的,消息队列是通过链表的方式实现的。由于消息的检索顺序与“先进先出”不同,因此“消息队列”这个名称并不合适。然而,新消息总是放在链表的末尾。
[*] As we'll see, the message queue is implemented by means of a linked list. Because messages can be retrieved in an order different from "first in, first out," the name "message queue" is not appropriate. However, new messages are always put at the end of the linked list.
POSIX标准(IEEE Std 1003.1-2001)定义了一种基于消息队列的IPC机制,通常称为 POSIX消息队列 。它们很像本章前面的“ IPC 消息”部分中已经讨论过的 System V IPC 消息队列。然而,POSIX 消息队列比旧队列具有许多优势:
The POSIX standard (IEEE Std 1003.1-2001) defines an IPC mechanism based on message queues, which is usually known as POSIX message queues . They are much like the System V IPC's message queues already examined in the section "IPC Messages" earlier in this chapter. However, POSIX message queues sport a number of advantages over the older queues:
更简单的基于文件的应用程序界面
A much simpler file-based interface to the applications
原生支持消息优先级(优先级最终决定消息在队列中的位置)
Native support for message priorities (the priority ultimately determines the position of the message in the queue)
通过信号或线程创建的方式原生支持消息到达的异步通知
Native support for asynchronous notification of message arrivals, either by means of signals or thread creation
阻止发送和接收操作的超时
Timeouts for blocking send and receive operations
POSIX 消息队列通过一组库函数进行处理,如表 19-15所示。
POSIX message queues are handled by means of a set of library functions, which are shown in Table 19-15.
表 19-15。POSIX 消息队列的库函数
Table 19-15. Library functions for POSIX message queues
函数名称 Function names | 描述 Description |
|---|---|
打开(可选地创建)POSIX 消息队列 Open (optionally creating) a POSIX message queue | |
关闭 POSIX 消息队列(不破坏它) Close a POSIX message queue (without destroying it) | |
销毁 POSIX 消息队列 Destroy a POSIX message queue | |
发送消息到POSIX消息队列;后一个函数定义了操作的时间限制 Send a message to a POSIX message queue; the latter function defines a time limit for the operation | |
从 POSIX 消息队列中获取消息;后一个函数定义了操作的时间限制 Fetch a message from a POSIX message queue; the latter function defines a time limit for the operation | |
为空 POSIX 消息队列中的消息到达建立异步通知机制 Establish an asynchronous notification mechanism for the arrival of messages in an empty POSIX message queue | |
分别获取和设置 POSIX 消息队列的属性(本质上是发送和接收操作应该是阻塞还是非阻塞) Respectively get and set attributes of a POSIX message queue (essentially, whether the send and receive operations should be blocking or nonblocking) |
让我们看看应用程序通常如何使用这些函数。第一步,应用程序调用mq_open( )库函数来打开 POSIX 消息队列。函数的第一个参数是一个字符串,指定队列的名称;它类似于文件名,实际上它必须以斜杠(/)开头。库函数接受标志的子集open( ) 系统调用:O_RDONLY、
O_WRONLY、O_RDWR、O_CREAT、O_EXCL和O_NONBLOCK(用于非阻塞发送和接收操作)。请注意,应用程序可以通过指定标志来创建新的 POSIX 消息队列O_CREAT。该mq_open( )函数返回队列的描述符——很像系统调用返回的文件描述符
open( )。
Let's see how an application typically makes use of these
functions. As a first step, the application invokes the mq_open( ) library function to open a POSIX
message queue. The first argument of the function is a string specifying
the name of the queue; it is similar to a filename, and indeed it must
start with a slash (/). The library
function accepts a subset of the flags of the open( ) system call: O_RDONLY,
O_WRONLY, O_RDWR, O_CREAT, O_EXCL, and O_NONBLOCK (for nonblocking send and receive
operations). Notice that the application may create a new POSIX message
queue by specifying the O_CREAT flag.
The mq_open( ) function returns a
descriptor for the queue—much like the file descriptor returned by the
open( ) system call.
一旦打开 POSIX 消息队列,应用程序就可以使用库函数mq_send( )和发送和接收消息mq_receive( ),并将由返回的队列描述符传递给它们mq_open( )
。该应用程序还可以利用mq_timedsend( ) 和mq_timedreceive( )
指定应用程序等待发送或接收操作完成所花费的最长时间。
Once a POSIX message queue has been opened, the application may
send and receive messages by using the library functions mq_send( ) and mq_receive( ), passing to them the queue
descriptor returned by mq_open( )
. The application may also make use of mq_timedsend( ) and mq_timedreceive( )
to specify the maximum time that the application will
spend waiting for the send or receive operation to complete.
应用程序还可以通过执行以下命令来建立异步通知机制,而不是阻塞mq_receive(
)或连续轮询消息队列(如果指定了标志)O_NONBLOCKmq_notify( ) 库函数。本质上,应用程序可能要求在将消息插入空队列时,要么向选定的进程发送信号,要么创建新线程。
Rather than blocking in mq_receive(
)—or continuously polling the message queue if the O_NONBLOCK flag was specified—the application
might also establish an asynchronous notification mechanism by executing
the mq_notify( ) library function. Essentially, the application may
require that when a message is inserted in an empty queue, either a
signal is sent to a selected process, or a new thread is created.
最后,当应用程序使用完消息队列后,它会调用mq_close( )
库函数;将队列描述符传递给它。请注意,该函数不会破坏队列,就像
close( ) 系统调用不会删除文件。要销毁队列,应用程序使用mq_unlink(
) 功能。
Finally, when the application has finished using the message
queue, it invokes the mq_close( )
library function; passing to it the queue descriptor.
Notice that this function does not destroy the queue, exactly as the
close( ) system call does not remove a file. To destroy a queue,
the application makes use of the mq_unlink(
) function.
Linux 2.6 中 POSIX 消息队列的实现简单明了。一个名为mqueue的特殊文件系统
(参见第12章中的“特殊文件系统”
部分)已经介绍,它包含每个现有队列的inode。内核提供了一些系统调用,它们大致对应于前面表 19-15中列出的库函数:、
、、、和mq_open( )mq_unlink( )mq_timedsend( )mq_timedreceive( )mq_notify( )mq_getsetattr( ) 。这些系统调用透明地作用于mqueue文件系统的文件,因此大部分工作是由 VFS 层完成的。例如,请注意内核不提供函数
mq_close( ):实际上,返回给应用程序的队列描述符实际上是文件描述符,因此mq_close( )
库函数可以简单地执行close(
)系统调用来完成其工作。
The implementation of POSIX message queues in Linux 2.6 is simple
and straightforward. A special filesystem named
mqueue (see the section "Special Filesystems" in
Chapter 12) has been
introduced, which contains an inode for each existing queue. The kernel
offers a few system calls, which roughly correspond to the library
functions listed in Table
19-15 earlier: mq_open( ),
mq_unlink( ), mq_timedsend( ), mq_timedreceive( ), mq_notify( ), and mq_getsetattr( ) . These system calls act transparently on the files of
the mqueue filesystem, thus much of the job is done
by the VFS layer. For example, notice that the kernel does not offer a
mq_close( ) function: in fact, the
queue descriptor returned to the application is effectively a file
descriptor, therefore the mq_close( )
library function can simply execute the close(
) system call to do its job.
mqueue特殊文件系统不一定必须安装在系统目录树上。但是,如果安装了它,用户可以通过触摸文件系统根目录中的文件来创建 POSIX 消息队列;她还可以通过读取相应的文件来获取有关队列的信息。最后,应用程序可以使用队列状态的变化select( )并获得通知。poll( )
The mqueue special filesystem must not
necessarily be mounted over the system directory tree. However, if it is
mounted, a user can create a POSIX message queue by touching a file in
the root directory of the filesystem; she can also get information about
the queue by reading the corresponding file. Finally, an application can
use select( ) and poll( ) to be notified about changes in the
queue state.
每个队列都由一个描述符来描述,该描述符体现了与mqueuemqueue_inode_info特殊文件系统中的文件关联的inode对象
。当POSIX消息队列系统调用接收队列描述符作为参数时,它会调用VFS的函数来导出相应文件对象的地址;接下来,系统调用获取mqueue文件系统中文件的 inode 对象,最后获取包含 inode 对象的描述符
的地址。fget( )mqueue_inode_info
Each queue is described by an mqueue_inode_info descriptor, which embodies
the inode object associated with the file in the
mqueue special filesystem. When a POSIX message
queue system call receives a queue descriptor as parameter, it invokes
the VFS's fget( ) function to derive
the address of the corresponding file object; next, the system call gets
the inode object of the file in the mqueue
filesystem, and finally the address of the mqueue_inode_info descriptor that contains the
inode object.
队列中的待处理消息被收集在以描述符为根的单链表中mqueue_inode_info
;每个消息都由一个类型的描述符表示——与本章前面“ IPC 消息msg_msg”部分中描述的 System V IPC 消息所使用的描述符完全相同。
The pending messages in a queue are collected in a singly linked
list rooted at the mqueue_inode_info
descriptor; each message is represented by a descriptor of type msg_msg—exactly the same descriptor used for
the System V IPC's messages described in the section "IPC Messages" earlier in
this chapter.
第 3 章中描述的“进程”概念从一开始就在 Unix 中用于表示竞争系统资源的运行程序组的行为。最后一章重点讨论程序和进程之间的关系。我们具体描述内核如何根据程序文件的内容为进程设置执行上下文。虽然将一堆指令加载到内存中并将 CPU 指向它们似乎不是一个大问题,但内核必须在几个方面处理灵活性:
The concept of a "process," described in Chapter 3, was used in Unix from the beginning to represent the behavior of groups of running programs that compete for system resources. This final chapter focuses on the relationship between program and process. We specifically describe how the kernel sets up the execution context for a process according to the contents of the program file. While it may not seem like a big problem to load a bunch of instructions into memory and point the CPU to them, the kernel has to deal with flexibility in several areas:
Linux 的特点是能够运行为其他操作系统编译的二进制文件。特别是,Linux 能够在同一计算机的 64 位版本上运行为 32 位计算机创建的可执行文件。例如,在 Pentium 上创建的可执行文件可以在 64 位 AMD Opteron 上运行。
Linux is distinguished by its ability to run binaries that were compiled for other operating systems. In particular, Linux is able to run an executable created for a 32-bit machine on the 64-bit version of the same machine. For instance, an executable created on a Pentium can run on a 64-bit AMD Opteron .
Many executable files don't contain all the code required to run the program but expect the kernel to load in functions from a library at runtime.
This includes the command-line arguments and environment variables familiar to programmers.
程序作为可执行文件存储在磁盘上,其中包括要执行的函数的目标代码以及这些函数将作用的数据。程序的许多功能是所有程序员都可以使用的服务例程;它们的目标代码包含在称为“库”的特殊文件中。实际上,库函数的代码可以静态地复制到可执行文件中(静态库),也可以在运行时链接到进程(共享库,因为它们的代码可以被多个独立进程共享)。
A program is stored on disk as an executable file, which includes both the object code of the functions to be executed and the data on which these functions will act. Many functions of the program are service routines available to all programmers; their object code is included in special files called "libraries." Actually, the code of a library function may either be statically copied into the executable file (static libraries) or linked to the process at runtime (shared libraries, because their code can be shared by several independent processes).
启动程序时,用户可以提供两种影响程序执行方式的信息:命令行参数和环境变量。命令行参数由用户在 shell 提示符下按照可执行文件名输入。环境变量,例如HOME和PATH,是从 shell 继承的,但用户可以在启动程序之前修改这些变量的值。
When launching a program, the user may supply two kinds of
information that affect the way it is executed: command-line arguments and
environment variables. Command-line arguments are
typed in by the user following the executable filename at the shell
prompt. Environment variables, such as HOME and PATH, are inherited from the shell, but the
users may modify the values of such variables before they launch the
program.
在“可执行文件”部分中,我们解释什么是程序执行上下文是。在“可执行格式”一节中,我们提到了 Linux 支持的一些可执行格式,并展示了 Linux 如何改变其“个性”来执行为其他操作系统编译的程序。最后,在“ exec 函数”部分中,我们描述了允许进程开始执行新程序的系统调用。
In the section "Executable Files," we explain what a program execution context is. In the section "Executable Formats," we mention some of the executable formats supported by Linux and show how Linux can change its "personality" to execute programs compiled for other operating systems. Finally, in the section "The exec Functions," we describe the system call that allows a process to start executing a new program.
第一章 将进程定义为“执行上下文”。我们指的是进行特定计算所需的信息的集合;它包括访问的页面、打开的文件、硬件寄存器内容等。可执行文件是一个常规文件,描述如何初始化新的执行上下文(即如何开始新的计算)。
Chapter 1 defined a process as an "execution context." By this we mean the collection of information needed to carry on a specific computation; it includes the pages accessed, the open files, the hardware register contents, and so on. An executable file is a regular file that describes how to initialize a new execution context (i.e., how to start a new computation).
假设用户想要列出当前目录中的文件;他知道这个结果可以通过在 shell 提示符下输入/bin/ls [ * ]外部命令的文件名来简单地实现。命令 shell 分叉一个新进程,该进程又调用系统调用(请参阅本章后面的“ exec 函数execve( )”部分),将包含ls
可执行文件的完整路径名的字符串作为其参数之一传递— / bin/ls,在本例中。这sys_execve( )服务例程找到相应的文件,检查可执行格式,并根据其中存储的信息修改当前进程的执行上下文。因此,当系统调用终止时,进程开始执行存储在可执行文件中的代码,该代码执行目录列表。
Suppose a user wants to list the files in the current directory;
he knows that this result can be simply achieved by typing the filename
of the /bin/ls [*] external command at the shell prompt. The command shell
forks a new process, which in turn invokes an execve( ) system call (see the section "The exec Functions" later in
this chapter), passing as one of its parameters a string that includes
the full pathname for the ls
executable file—/bin/ls, in this
case. The sys_execve( ) service
routine finds the corresponding file, checks the executable format, and
modifies the execution context of the current process according to the
information stored in it. As a result, when the system call terminates,
the process starts executing the code stored in the executable file,
which performs the directory listing.
当进程开始运行新程序时,其执行上下文会发生巨大变化,因为进程先前计算期间获得的大部分资源都被丢弃。在前面的示例中,当进程开始执行/bin/ls时,它会用系统调用中作为参数传递的新参数替换 shell 的参数execve( ),并获取新的 shell 环境(请参阅后面的“命令行参数和 Shell 环境”一节) )。所有从父级继承的页面(并与写入时复制机制共享)都被释放,以便新的计算从新的用户模式地址空间开始;甚至进程的权限也可能会改变(请参阅后面的部分“进程凭据和功能”)。但是,进程 PID 不会改变,并且新的计算继承了前一个所有打开的文件描述符执行系统调用时不会自动关闭
execve( )。[ * ]
When a process starts running a new program, its execution context
changes drastically because most of the resources obtained during the
process's previous computations are discarded. In the preceding example,
when the process starts executing /bin/ls, it replaces the shell's arguments
with new ones passed as parameters in the execve( ) system call and acquires a new shell
environment (see the later section "Command-Line Arguments and Shell
Environment"). All pages inherited from the parent (and shared
with the Copy On Write mechanism) are released so that the new
computation starts with a fresh User Mode address space; even the
privileges of the process could change (see the later section "Process Credentials and
Capabilities"). However, the process PID doesn't change, and the
new computation inherits from the previous one all open file
descriptors that were not closed automatically while executing the
execve( ) system call.[*]
传统上,Unix 系统与每个进程关联一些凭据,这些凭据将进程绑定到特定用户和特定用户组。凭证对于多用户系统很重要,因为它们决定每个进程可以做什么或不能做什么,从而保护每个用户个人数据的完整性和整个系统的稳定性。
Traditionally, Unix systems associate with each process some credentials, which bind the process to a specific user and a specific user group. Credentials are important on multiuser systems because they determine what each process can or cannot do, thus preserving both the integrity of each user's personal data and the stability of the system as a whole.
凭证的使用需要流程数据结构和受保护资源的支持。一种明显的资源是文件。因此,在 Ext2 文件系统中,每个文件都由特定用户拥有并绑定到一组用户。文件的所有者可以决定允许对该文件进行哪些类型的操作,区分自己、文件的用户组和所有其他用户。当进程尝试访问文件时,VFS 始终根据文件所有者建立的权限和进程凭据检查访问是否合法。
The use of credentials requires support both in the process data structure and in the resources being protected. One obvious resource is a file. Thus, in the Ext2 filesystem , each file is owned by a specific user and is bound to a group of users. The owner of a file may decide what kind of operations are allowed on that file, distinguishing among herself, the file's user group, and all other users. When a process tries to access a file, the VFS always checks whether the access is legal, according to the permissions established by the file owner and the process credentials .
进程的凭据存储在进程描述符的几个字段中,如表 20-1所列。这些字段包含系统中用户和用户组的标识符,通常将其与正在访问的文件的索引节点中存储的相应标识符进行比较。
The process's credentials are stored in several fields of the process descriptor, listed in Table 20-1. These fields contain identifiers of users and user groups in the system, which are usually compared with the corresponding identifiers stored in the inodes of the files being accessed.
表 20-1。传统流程凭证
Table 20-1. Traditional process credentials
姓名 Name | 描述 Description |
|---|---|
| 用户和组真实标识符 User and group real identifiers |
| 用户和组有效标识符 User and group effective identifiers |
| 文件访问的用户和组有效标识符 User and group effective identifiers for file access |
| 补充组标识符 Supplementary group identifiers |
| 用户和组保存的标识符 User and group saved identifiers |
一个UID0指定超级用户(root),而用户组ID0 指定根组。如果进程凭证存储的值为 0,则内核会绕过权限检查并允许特权进程执行各种操作,例如涉及系统管理或硬件操作的操作,而非特权进程则无法执行这些操作。
A UID of 0 specifies the superuser (root), while a user group ID of 0 specifies the root group. If a process credential stores a value of 0, the kernel bypasses the permission checks and allows the privileged process to perform various actions, such as those referring to system administration or hardware manipulation, that are not possible to unprivileged processes.
创建进程时,它始终继承其父进程的凭据。但是,这些凭据可以在以后修改,无论是当进程开始执行新程序时还是当它发出合适的系统调用时。通常,流程的uid、euid、fsuid和suid字段包含相同的值。当进程执行setuid 程序(即
setuid标志打开的可执行文件)时,euid和fsuid字段将设置为文件所有者的标识符。几乎所有检查都涉及这两个字段之一:
fsuid用于与文件相关的操作,而euid用于所有其他操作。类似的考虑因素适用于引用组标识符的gid、egid、fsgid和字段。sgid
When a process is created, it always inherits the credentials of
its parent. However, these credentials can be modified later, either
when the process starts executing a new program or when it issues
suitable system calls. Usually, the uid, euid, fsuid, and suid fields of a process contain the same
value. When the process executes a setuid
program—that is, an executable file whose
setuid flag is on—the euid and fsuid fields are set to the identifier of
the file's owner. Almost all checks involve one of these two fields:
fsuid is used for file-related
operations, while euid is used for
all other operations. Similar considerations apply to the gid, egid, fsgid, and sgid fields that refer to group
identifiers.
作为如何使用该字段的说明fsuid,请考虑用户想要更改其密码时的典型情况。所有密码都存储在一个公共文件中,但他无法直接编辑该文件,因为该文件受到保护。因此,他调用一个名为
/usr/bin/passwd的系统程序,该程序
设置了setuid标志,并且其所有者是超级用户。当 shell 派生的进程执行此类程序时,其
euid和fsuid字段将设置为 0 — 超级用户的 PID。现在进程可以访问该文件了,因为当内核执行访问控制时,它发现fsuid. 当然,/usr/bin/passwd程序不允许用户执行任何操作,只能更改自己的密码。
As an illustration of how the fsuid field is used, consider the typical
situation when a user wants to change his password. All passwords are
stored in a common file, but he cannot directly edit this file because
it is protected. Therefore, he invokes a system program named
/usr/bin/passwd, which has the
setuid flag set and whose owner is the superuser.
When the process forked by the shell executes such a program, its
euid and fsuid fields are set to 0—to the PID of the
superuser. Now the process can access the file, because, when the
kernel performs the access control, it finds a 0 value in fsuid. Of course, the /usr/bin/passwd program does not allow the
user to do anything but change his own password.
Unix 悠久的历史告诉我们setuid 程序的教训 设置了setuid标志的程序
非常危险:恶意用户可能会触发代码中的一些编程错误(bug),以强制
setuid程序执行程序原始设计者从未计划过的操作。在最坏的情况下,整个系统的安全可能会受到损害。为了最大限度地减少此类风险,Linux 与所有现代 Unix 系统一样,允许进程仅在必要时获取setuid权限,并在不再需要时删除它们。当实现具有多个保护级别的用户应用程序时,此功能可能会很有用。进程描述符包括suid字段,该字段存储setuideuid程序启动时有效标识符(和
fsuid)
的值。进程可以通过、、、 和系统调用来更改有效标识符。[ * ]setuid( )setresuid( )setfsuid( )setreuid( )
Unix's long history teaches the lesson that setuid
programs —programs that have the setuid
flag set—are quite dangerous: malicious users could trigger some
programming errors (bugs) in the code to force
setuid programs to perform operations that were
never planned by the program's original designers. In the worst case,
the entire system's security can be compromised. To minimize such
risks, Linux, like all modern Unix systems, allows processes to
acquire setuid privileges only when necessary and
drop them when they are no longer needed. This feature may turn out to
be useful when implementing user applications with several protection
levels. The process descriptor includes an suid field, which stores the values of the
effective identifiers (euid and
fsuid) at the
setuid program startup. The process can change
the effective identifiers by means of the setuid( ), setresuid( ), setfsuid( ), and setreuid( ) system calls.[*]
表 20-2
显示了这些系统调用如何影响进程的凭据。请注意,如果调用进程尚不具有超级用户权限(即,如果其euid
字段不为空),则这些系统调用只能用于设置已包含在进程的凭据字段中的值。例如,普通用户进程可以通过调用系统调用将值存储500到其fsuid字段中setfsuid( ),但前提是其他凭证字段之一已经拥有相同的值。
Table 20-2
shows how these system calls affect the process's credentials. Be
warned that if the calling process does not already have superuser
privileges—that is, if its euid
field is not null—these system calls can be used only to set values
already included in the process's credential fields. For instance, an
average user process can store the value 500 into its fsuid field by invoking the setfsuid( ) system call, but only if one of
the other credential fields already holds the same value.
表20-2。设置进程凭据的系统调用的语义
Table 20-2. Semantics of the system calls that set process credentials
场地 Field | setuid(e) setuid (e) | setresuid(u、e、s) setresuid (u,e,s) | 塞特雷伊德 (u,e) setreuid (u,e) | setfsuid (f) setfsuid (f) | |
|---|---|---|---|---|---|
euid=0 euid=0 | euid≠0 euid≠0 | ||||
| 设置 Set to | 不变 Unchanged | 设置 Set to | 设置 Set to | 不变 Unchanged |
| 设置 Set to | 设置 Set to | 设置 Set to | 设置 Set to | 不变 Unchanged |
| 设置 Set to | 设置 Set to | 设置 Set to | 设置 Set to | 设置 Set to |
| 设置 Set to | 不变 Unchanged | 设置 Set to | 设置 Set to | 不变 Unchanged |
要了解四个用户 ID 字段之间有时复杂的关系,请考虑一下系统调用的影响setuid( )。操作有所不同,具体取决于调用进程的euid字段是否设置为 0(即进程具有超级用户权限)还是普通 UID。
To understand the sometimes complex relationships among the four
user ID fields, consider for a moment the effects of the setuid( ) system call. The actions are
different, depending on whether the calling process's euid field is set to 0 (that is, the process
has superuser privileges) or to a normal UID .
如果该euid字段为 0,则系统调用将调用进程的所有凭证字段(uid、euid、fsuid和suid)设置为参数 的值e。因此,超级用户进程可以放弃其特权并成为普通用户拥有的进程。例如,当用户登录时,就会发生这种情况:系统派生出一个具有超级用户权限的新进程,但该进程通过调用系统调用来放弃其权限,然后开始执行setuid( )用户的登录 shell 程序。
If the euid field is 0, the
system call sets all credential fields of the calling process
(uid, euid, fsuid, and suid) to the value of the parameter e. A superuser process can thus drop its
privileges and become a process owned by a normal user. This happens,
for instance, when a user logs in: the system forks a new process with
superuser privileges, but the process drops its privileges by invoking
the setuid( ) system call and then
starts executing the user's login shell program.
如果该euid字段不为 0,则系统调用仅修改和
setuid( )中存储的值,而其他两个字段保持不变。当实现一个setuid程序来扩大和缩小存储在和字段中的有效进程的权限时,系统调用的这种行为非常有用。euidfsuideuidfsuid
If the euid field is not 0,
the setuid( ) system call modifies
only the value stored in euid and
fsuid, leaving the other two fields
unchanged. This behavior of the system call is useful when
implementing a setuid program that scales up and
down the effective process's privileges stored in the euid and fsuid fields.
POSIX.1e 草案(现已撤回)引入了另一种基于“功能”概念的流程凭证模型。Linux 内核支持 POSIX 功能,尽管大多数 Linux 发行版并不使用它们。
The POSIX.1e draft—now withdrawn—introduced another model of process credentials based on the notion of "capabilities." The Linux kernel supports POSIX capabilities, although most Linux distributions do not make use of them.
能力只是一个标志,断言是否允许进程执行特定操作或特定类别的操作。此模型不同于传统的“超级用户与普通用户”模型,在传统的“超级用户与普通用户”模型中,进程可以执行所有操作或不执行任何操作,具体取决于其有效 UID。如表 20-3所示,Linux 内核中包含了多种功能。
A capability is simply a flag that asserts whether the process is allowed to perform a specific operation or a specific class of operations. This model is different from the traditional "superuser versus normal user" model in which a process can either do everything or do nothing, depending on its effective UID. As illustrated in Table 20-3, several capabilities have been included in the Linux kernel.
表 20-3。Linux 功能
Table 20-3. Linux capabilities
姓名 Name | 描述 Description |
|---|---|
| 允许通过写入 netlink 套接字来生成审核消息 Allow to generate audit messages by writing in netlink sockets |
| 允许通过 netlink 套接字控制内核审计活动 Allow to control kernel auditing activities by means of netlink sockets |
| 忽略对文件用户和组所有权更改的限制 Ignore restrictions on file user and group ownership changes |
| 忽略文件访问权限 Ignore file access permissions |
| 忽略文件/目录读取和搜索权限 Ignore file/directory read and search permissions |
| 通常忽略对文件所有权的权限检查 Generally ignore permission checks on file ownership |
| 忽略设置 文件setuid和setgid标志的限制 Ignore restrictions on setting the setuid and setgid flags for files |
| |
| 允许修改仅附加和不可变的 Ext2/Ext3 文件 Allow modification of append-only and immutable Ext2/Ext3 files |
| 允许锁定页面和共享内存段 Allow locking of pages and of shared memory segments |
| 跳过 IPC 所有权检查 Skip IPC ownership checks |
| 允许对文件进行租用(请参阅第 12 章中的“ Linux 文件锁定” ) Allow taking of leases on files (see "Linux File Locking" in Chapter 12) |
| 允许特权 Allow privileged |
| 允许一般网络管理 Allow general networking administration |
| 允许绑定到 1,024 以下的 TCP/UDP 套接字 Allow binding to TCP/UDP sockets below 1,024 |
| 允许广播和多播 Allow broadcasting and multicasting |
| 允许使用 RAW 和 PACKET 套接字 Allow use of RAW and PACKET sockets |
| 忽略对组进程凭据操作的限制 Ignore restrictions on group's process credentials manipulations |
| 允许对其他进程进行能力操作 Allow capability manipulations on other processes |
| 忽略对用户进程凭据操作的限制 Ignore restrictions on user's process credentials manipulations |
| 允许一般系统管理 Allow general system administration |
| 允许使用 Allow use of |
| 允许使用 Allow use of |
| 允许插入和删除内核模块 Allow inserting and removing of kernel modules |
|
Skip permission checks of the
|
| 允许配置进程记帐 Allow configuration of process accounting |
|
Allow use of |
|
Allow access to I/O ports through
|
| 允许增加资源限制 Allow resource limits to be increased |
| 允许操纵系统时钟和实时时钟 Allow manipulation of system clock and real-time clock |
| Allow to configure the terminal
and to execute the |
功能的主要优点是,在任何时候,每个程序都需要有限数量的功能。因此,即使恶意用户发现了利用有缺陷程序的方法,她也只能非法执行一组有限的操作。
The main advantage of capabilities is that, at any time, each program needs a limited number of them. Consequently, even if a malicious user discovers a way to exploit a buggy program, she can illegally perform only a limited set of operations.
例如,假设一个有缺陷的程序仅具有该
CAP_SYS_TIME功能。在这种情况下,发现该漏洞的恶意用户只能通过非法更改实时时钟和系统时钟来成功。她将无法执行任何其他类型的特权操作。
Assume, for instance, that a buggy program has only the
CAP_SYS_TIME capability. In this
case, the malicious user who discovers an exploitation of the bug
can succeed only in illegally changing the real-time clock and the
system clock. She won't be able to perform any other kind of
privileged operations.
VFS 和 Ext2 文件系统都不是当前支持功能模型,因此无法将可执行文件与进程执行该文件时应强制执行的功能集关联起来。然而,进程可以分别使用capget( )和capset( )系统调用显式地获取和降低其功能。例如,可以修改登录
程序以保留功能的子集并删除其他功能。
Neither the VFS nor the Ext2 filesystem currently supports the capability model, so there is
no way to associate an executable file with the set of capabilities
that should be enforced when a process executes that file.
Nevertheless, a process can explicitly get and lower its
capabilities by using, respectively, the capget( ) and capset( ) system calls. For instance, it
is possible to modify the login
program to retain a subset of the capabilities and drop the
others.
Linux 内核已经考虑了功能。例如,我们考虑nice(
)系统调用,它允许用户更改进程的静态优先级。在传统模型中,只有超级用户才能提出优先级;因此,内核应该检查
euid调用进程的描述符中的该字段是否设置为0。然而,Linux内核定义了一个名为 的功能CAP_SYS_NICE,它正好对应于这种操作。内核通过调用该函数capable( )并将值传递
CAP_SYS_NICE给它来检查该标志的值。
The Linux kernel already takes capabilities into account.
Let's consider, for instance, the nice(
) system call, which allows users to change the static
priority of a process. In the traditional model, only the superuser
can raise a priority; the kernel should therefore check whether the
euid field in the descriptor of
the calling process is set to 0. However, the Linux kernel defines a
capability called CAP_SYS_NICE,
which corresponds exactly to this kind of operation. The kernel
checks the value of this flag by invoking the capable( ) function and passing the
CAP_SYS_NICE value to it.
这种方法之所以有效,要归功于已添加到内核代码中的一些“兼容性黑客”:每次进程将
euid和字段设置为 0(通过调用表 20-2fsuid中列出的系统调用之一或通过执行setuid超级用户拥有的程序),内核设置所有进程能力,以便所有检查都会成功。当进程将和字段重置为真实 UID时euidfsuid作为进程所有者,内核检查keep_capabilities进程描述符中的标志,如果设置了该标志,则删除该进程的所有功能。进程可以keep_capabilities通过 Linux 特定的方法设置和重置标志prctl( )
系统调用。
This approach works, thanks to some "compatibility hacks" that
have been added to the kernel code: each time a process sets the
euid and fsuid fields to 0 (either by invoking one
of the system calls listed in Table 20-2 or by
executing a setuid program owned by the
superuser), the kernel sets all process capabilities so that all
checks will succeed. When the process resets the euid and fsuid fields to the real UID of the process owner, the kernel checks the keep_capabilities flag in the process
descriptor and drops all capabilities of the process if the flag is
set. A process can set and reset the keep_capabilities flag by means of the
Linux-specific prctl( )
system call.
在 Linux 2.6 中,功能与Linux 安全模块框架 ( LSM )紧密集成。简而言之,LSM 框架允许开发人员为内核安全定义几种替代模型。
In Linux 2.6, capabilities are tightly integrated with the Linux Security Modules framework (LSM). In short, the LSM framework allows developers to define several alternative models for kernel security.
每个安全模型都是由一组 安全钩子实现的 。安全挂钩是内核在即将执行重要的、与安全相关的操作时调用的函数。钩子函数决定操作是继续还是拒绝。
Each security model is implemented by a set of security hooks . A security hook is a function that is invoked by the kernel when it is about to perform an important, security-related operation. The hook function determines whether the operation should be carried on or rejected.
安全挂钩存储在类型为 的表中security_operations。当前使用的安全模型的挂钩表的地址存储在该
security_ops变量中。默认情况下,内核使用表实现的最小安全模型dummy_security_ops;该表中的每个钩子本质上都会检查相应的功能(如果有),或者无条件返回 0(允许操作)。
The security hooks are stored in a table of type security_operations. The address of the
hook table for the security model currently in use is stored in the
security_ops variable. By
default, the kernel makes use of a minimal security model
implemented by the dummy_security_ops table; each hook in
this table essentially checks the corresponding capability, if any,
or unconditionally returns 0 (operation allowed).
stime( )例如,和函数的服务例程在更改系统日期和时间之前settimeofday( )调用
安全挂钩。settime该表指向的相应函数dummy_security_ops限制自身检查CAP_SYS_TIME
当前进程的能力是否已设置,并返回 0 或
-EPERM相应值。
For instance, the service routines of the stime( ) and settimeofday( ) functions invoke the
settime security hook before
changing the system date and time. The corresponding function
pointed to by the dummy_security_ops table limits itself in
checking whether the CAP_SYS_TIME
capability of the current process is set, and returns either 0 or
-EPERM accordingly.
Linux 内核的复杂安全模型已经被设计出来。一个广为人知的例子是安全增强型 Linux(SELinux),由美国国家安全局开发。
Sophisticated security models for the Linux kernel have been devised. A widely known example is Security-Enhanced Linux (SELinux), developed by the United State's National Security Agency.
当用户键入命令时,为满足请求而加载的程序可能会收到一些命令行参数 从壳中。例如,当用户键入命令时:
When a user types a command, the program that is loaded to satisfy the request may receive some command-line arguments from the shell. For example, when a user types the command:
$ ls -l /usr/bin
$ ls -l /usr/bin
为了获取/usr/bin目录中文件的完整列表,shell 进程创建一个新进程来执行命令。这个新进程加载/bin/ls可执行文件。这样做时,从 shell 继承的大部分执行上下文都会丢失,但三个单独的参数ls、-l和/usr/bin会保留。一般来说,新进程可以接收任意数量的参数。
to get a full listing of the files in the /usr/bin directory, the shell process
creates a new process to execute the command. This new process loads
the /bin/ls executable file. In
doing so, most of the execution context inherited from the shell is
lost, but the three separate arguments ls, -l,
and /usr/bin are kept. Generally,
the new process may receive any number of arguments.
传递命令行参数的约定取决于所使用的高级语言。在 C 语言中,main( )程序的函数可以接收一个整数作为其参数,该整数指定已传递给程序的参数数量以及字符串指针数组的地址。以下原型正式化了该标准:
The conventions for passing the command-line arguments depend on
the high-level language used. In the C language, the main( ) function of a program may receive as
its parameters an integer specifying how many arguments have been
passed to the program and the address of an array of pointers to
strings. The following prototype formalizes this standard:
int main(int argc, char *argv[])
int main(int argc, char *argv[])
回到前面的示例,当调用/bin/lsargc程序时,其值为 3,argv[0]指向ls字符串,argv[1]指向-l字符串,并argv[2]指向/usr/bin字符串。数组的末尾argv始终由空指针标记,因此argv[3]contains
NULL。
Going back to the previous example, when the /bin/ls program is invoked, argc has the value 3, argv[0] points to the ls string, argv[1] points to the -l string, and argv[2] points to the /usr/bin string. The end of the argv array is always marked by a null
pointer, so argv[3] contains
NULL.
在 C 语言中可以传递给函数的第三个可选参数是包含环境变量的main( )参数
。它们用于自定义进程的执行上下文,向用户或其他进程提供一般信息,或允许进程在系统调用中保留某些信息execve( )。
A third optional parameter that may be passed in the C language
to the main( ) function is the
parameter containing environment variables
. They are used to customize the execution context of a
process, to provide general information to a user or other processes,
or to allow a process to keep some information across an execve( ) system call.
要使用环境变量,main(
)可以声明如下:
To use the environment variables, main(
) can be declared as follows:
int main(int argc, char *argv[], char *envp[])
int main(int argc, char *argv[], char *envp[])
该envp参数指向指向环境字符串的指针数组,其形式为:
The envp parameter points to
an array of pointers to environment strings of the form:
VAR_NAME=某事
VAR_NAME=something
其中VAR_NAME表示环境变量的名称,而分隔符后面的子字符串
=表示分配给该变量的实际值。与数组一样,数组的末尾envp由空指针标记argv。数组的地址
envp也存储在
environC库的全局变量中。
where VAR_NAME represents the
name of an environment variable, while the substring following the
= delimiter represents the actual
value assigned to the variable. The end of the envp array is marked by a null pointer, like
the argv array. The address of the
envp array is also stored in the
environ global variable of the C
library.
命令行参数和环境字符串放置在用户模式堆栈中,就在返回地址之前(请参阅第 10 章中的“参数传递”部分)。用户模式堆栈的底部位置如图 20-1所示。请注意,环境变量位于堆栈底部附近,就在 0 长整数之后。
Command-line arguments and environment strings are placed on the User Mode stack, right before the return address (see the section "Parameter Passing" in Chapter 10). The bottom locations of the User Mode stack are illustrated in Figure 20-1. Notice that the environment variables are located near the bottom of the stack, right after a 0 long integer.
每个高级源代码文件经过几个步骤转化为目标文件,其中包含与高级指令对应的汇编语言指令的机器代码。目标文件无法执行,因为它不包含与源代码文件外部的全局符号名称的每个引用相对应的线性地址,例如库中的函数或同一程序的其他源代码文件。此类地址的分配或解析由链接器执行,链接器收集所有目标文件程序并构造可执行文件。链接器还分析程序使用的库函数,并以本章稍后描述的方式将它们粘合到可执行文件中。
Each high-level source code file is transformed through several steps into an object file, which contains the machine code of the assembly language instructions corresponding to the high-level instructions. An object file cannot be executed, because it does not contain the linear address that corresponds to each reference to a name of a global symbol external to the source code file, such as functions in libraries or other source code files of the same program. The assigning, or resolution, of such addresses is performed by the linker, which collects all the object files of the program and constructs the executable file. The linker also analyzes the library's functions used by the program and glues them into the executable file in a manner described later in this chapter.
大多数程序,即使是最琐碎的程序,都使用库。例如,考虑以下一行 C 程序:
Most programs, even the most trivial ones, use libraries. Consider, for instance, the following one-line C program:
无效主(无效){} void main(void) { }虽然这个程序不计算任何东西,但是需要做很多工作来设置执行环境(参见本章后面的“ exec函数”部分)并在程序终止时杀死进程(参见“销毁进程”部分) ”在第 3 章)。特别是,当main(
)函数终止时,C 编译器会exit_group( )在目标代码中插入函数调用。
Although this program does not compute anything, a lot of work
is needed to set up the execution environment (see the section "The exec Functions" later
in this chapter) and to kill the process when the program terminates
(see the section "Destroying
Processes" in Chapter
3). In particular, when the main(
) function terminates, the C compiler inserts an exit_group( ) function call in the object
code.
从第10章我们知道,程序通常通过包装例程来调用系统调用在 C 库中。这也适用于 C 编译器。除了包含通过编译程序语句直接生成的代码之外,每个可执行文件还包含一些“粘合”代码来处理用户模式进程与内核的交互。此类粘合代码的部分内容存储在 C 库中。
We know from Chapter 10 that programs usually invoke system calls through wrapper routines in the C library. This holds for the C compiler, too. Besides including the code directly generated by compiling the program's statements, each executable file also includes some "glue" code to handle the interactions of the User Mode process with the kernel. Portions of such glue code are stored in the C library.
除了 C 库之外,Unix 系统中还包含许多其他函数库。通用 Linux 系统通常使用数百个库。仅举其中的几个:数学库libm包含用于浮点运算的高级函数,而 X11 库 libX11则收集了 X11 窗口系统图形界面的基本低级函数。
Many other libraries of functions, besides the C library, are included in Unix systems. A generic Linux system typically uses several hundreds of libraries. Just to mention a couple of them: the math library libm includes advanced functions for floating point operations, while the X11 library libX11 collects together the basic low-level functions for the X11 Window System graphics interface.
传统Unix系统中的所有可执行文件都是基于 静态库的 。这意味着链接器产生的可执行文件不仅包括原始程序的代码,还包括该程序引用的库函数的代码。静态链接程序的一大缺点是它们会占用大量磁盘空间。事实上,每个静态链接的可执行文件都会复制库代码的某些部分。
All executable files in traditional Unix systems were based on static libraries . This means that the executable file produced by the linker includes not only the code of the original program but also the code of the library functions that the program refers to. One big disadvantage of statically linked programs is that they eat lots of space on disk. Indeed, each statically linked executable file duplicates some portion of library code.
现代 Unix 系统使用共享库
。可执行文件不包含库目标代码,而仅包含对库名称的引用。当程序被加载到内存中执行时,一个合适的程序称为
动态链接器 (也称为ld.so
)负责分析可执行文件中的库名称,在系统目录树中定位库并使所请求的代码可供执行进程使用。进程还可以在运行时加载其他共享库,方法是使用dlopen( ) 库函数。
Modern Unix systems use shared libraries
. The executable file does not contain the library
object code, but only a reference to the library name. When the
program is loaded in memory for execution, a suitable program called
dynamic linker (also named ld.so
) takes care of analyzing the library names in the
executable file, locating the library in the system's directory tree
and making the requested code available to the executing process. A
process can also load additional shared libraries at runtime by using
the dlopen( ) library function.
共享库在提供文件内存映射的系统上特别方便,因为它们减少了执行程序所需的主内存量。当动态链接器必须将共享库链接到进程时,它不会复制目标代码,而仅执行库文件相关部分到进程地址空间的内存映射。这允许包含库机器代码的页框在使用相同代码的所有进程之间共享。显然,如果程序已静态链接,则不可能进行共享。
Shared libraries are especially convenient on systems that provide file memory mapping, because they reduce the amount of main memory requested for executing a program. When the dynamic linker must link a shared library to a process, it does not copy the object code, but performs only a memory mapping of the relevant portion of the library file into the process's address space. This allows the page frames containing the machine code of the library to be shared among all processes that are using the same code. Clearly, sharing is not possible if the program has been linked statically.
共享库也有一些缺点。动态链接程序的启动时间通常比静态链接程序的启动时间长。此外,动态链接的程序不像静态链接的程序那样可移植,因为它们可能无法在包含同一库的不同版本的系统中正确执行。
Shared libraries also have some disadvantages. The startup time of a dynamically linked program is usually longer than that of a statically linked one. Moreover, dynamically linked programs are not as portable as statically linked ones, because they may not execute properly in systems that include a different version of the same library.
用户可能总是需要静态链接程序。例如,GCC 编译器提供了该-static选项,告诉链接器使用静态库而不是共享库。
A user may always require a program to be linked statically. For
example, the GCC compiler offers the -static option, which tells the linker to
use the static libraries instead of the shared ones.
从逻辑角度来看,Unix 程序的线性地址空间传统上被划分为几个称为段的线性地址区间:[ * ]
The linear address space of a Unix program is traditionally partitioned, from a logical point of view, in several linear address intervals called segments :[*]
包括程序的可执行代码。
Includes the program's executable code.
包含初始化数据——即初始值存储在可执行文件中的静态变量和全局变量(因为程序在启动时必须知道它们的值)。
Contains the initialized data—that is, the static variables and the global variables whose initial values are stored in the executable file (because the program must know their values at startup).
包含未初始化的数据——即初始值未存储在可执行文件中的所有全局变量(因为程序在引用它们之前设置了值);它历史上被称为bss 段。
Contains the uninitialized data—that is, all global variables whose initial values are not stored in the executable file (because the program sets the values before referencing them); it is historically called a bss segment.
包含程序堆栈,其中包括正在执行的函数的返回地址、参数和局部变量。
Contains the program stack, which includes the return addresses, parameters, and local variables of the functions being executed.
每个内存描述符(参见第9章中的“内存描述符”
mm_struct部分)都包含一些字段,这些字段标识了一些关键内存区域的作用对应的流程:
Each mm_struct memory
descriptor (see the section "The Memory Descriptor" in
Chapter 9) includes some
fields that identify the role of a few crucial memory
regions of the corresponding process:
start_code,end_codestart_code, end_code存储包含程序本机代码(可执行文件中的代码)的内存区域的初始和最终线性地址。
Store the initial and final linear addresses of the memory region that includes the native code of the program—the code in the executable file.
start_data,end_datastart_data, end_data存储包含程序的本机初始化数据的内存区域的初始和最终线性地址,如可执行文件中所指定。这些字段标识大致对应于数据段的存储区域。
Store the initial and final linear addresses of the memory region that includes the native initialized data of the program, as specified in the executable file. The fields identify a memory region that roughly corresponds to the data segment.
start_brk,brkstart_brk, brk存储内存区域的初始和最终线性地址,该内存区域包括进程动态分配的内存区域(请参阅第 9 章中的“管理堆” 部分)。该内存区域有时称为 堆。
Store the initial and final linear addresses of the memory region that includes the dynamically allocated memory areas of the process (see the section "Managing the Heap" in Chapter 9). This memory region is sometimes called the heap.
start_stackstart_stack将地址存储在 的main( )返回地址的正上方;如图20-1所示,保留较高的地址(回想一下,堆栈向较低的地址增长)。
Stores the address right above that of main( )'s return address; as
illustrated in Figure
20-1, higher addresses are reserved (recall that stacks
grow toward lower addresses).
arg_start,arg_endarg_start, arg_end存储包含命令行参数的堆栈部分的初始和最终地址。
Store the initial and final addresses of the stack portion containing the command-line arguments.
env_start,env_endenv_start, env_end存储包含环境字符串的堆栈部分的初始和最终地址。
Store the initial and final addresses of the stack portion containing the environment strings.
请注意,共享库和文件内存映射已经使基于程序段的进程地址空间分类变得过时,因为每个共享库都映射到与前面列表中讨论的内存区域不同的内存区域。
Notice that shared libraries and file memory mapping have made the classification of the process's address space based on program segments obsolete, because each of the shared libraries is mapped into a different memory region from those discussed in the preceding list.
灵活的内存区域布局 已在内核版本 2.6.9 中引入:本质上,每个进程都会获得一个内存布局,该布局取决于用户模式堆栈预计增长的程度。但是,仍然可以使用旧的经典布局(主要是当内核无法限制进程的用户模式堆栈的大小时)。表 20-4中描述了两种布局,假设采用 80 × 86 架构,默认用户模式地址空间最大可达 3 GB。
The flexible memory region layout has been introduced in the kernel version 2.6.9: essentially, each process gets a memory layout that depends on how much the User Mode stack is expected to grow. However, the old, classical layout can still be used (mainly when the kernel cannot put a limit on the size of the User Mode stack of a process). Both layouts are described in Table 20-4, assuming the 80 × 86 architecture with the default User Mode address space spanning up to 3 GB.
表 20-4。80×86架构中的内存区域布局
Table 20-4. The memory region layouts in the 80 × 86 architecture
内存区域类型 Type of memory region | 古典布局 Classical layout | 布局灵活 Flexible layout |
|---|---|---|
文本段(ELF) Text segment (ELF) | 开始于 Starts from | |
数据和 bss 段 Data and bss segments | 在文本段之后开始 Starts right after the text segment | |
堆 Heap | 在 data 和 bss 段之后开始 Starts right after the data and bss segments | |
文件内存映射和匿名内存区域 File memory mappings and anonymous memory regions | 从 Starts from | 在用户模式堆栈的末尾(最低地址)附近开始;库添加到连续较低的地址 Starts near the end (lowest address) of the User Mode stack; libraries added at successively lower addresses |
用户态堆栈 User Mode stack | 从较低地址开始 Starts at | |
如您所见,布局仅在文件内存映射和匿名映射的内存区域位置上有所不同。在经典布局中,这些区域从整个用户模式地址空间的三分之一开始放置,通常位于0x40000000;新的区域被添加到更高的线性地址,因此区域向用户模式堆栈扩展。
As you can see, the layouts differ only on the position of the
memory regions for file memory mappings and anonymous mappings. In
the classical layout, these regions are placed starting at one-third
of the whole User Mode address space, usually at 0x40000000; newer regions are added at
higher linear addresses, thus the regions expand towards the User
Mode stack.
相反,在灵活布局中,用于文件内存映射和匿名映射的内存区域放置在用户模式堆栈末尾附近;新的区域被添加到较低的线性地址,因此区域向堆扩展。请记住,堆栈也会向较低地址增长。
Conversely, in the flexible layout the memory regions for file memory mapping and anonymous mappings are placed near the end of the User Mode stack; newer regions are added at lower linear addresses, thus the regions expand towards the heap. Remember that the stack grows towards lower addresses, too.
当内核可以通过资源限制(参见第 3 章中的“进程资源限制”RLIMIT_STACK部分)来限制用户模式堆栈的大小时,内核通常会使用灵活布局。这个限制决定了为堆栈保留的线性地址空间的大小;但是,此大小不能小于 128 MB 或大于 2.5 GB。
The kernel typically uses the flexible layout when it can get
a limit on the size of the User Mode stack by means of the RLIMIT_STACK resource limit (see the
section "Process
Resource Limits" in Chapter 3). This limit
determines the size of the linear address space reserved for the
stack; however, this size cannot be smaller than 128 MB or larger
than 2.5 GB.
另一方面,如果RLIMIT_STACK资源限制设置为“无穷大”或系统管理员将变量设置为 1 sysctl_legacy_va_layout(通过写入/proc /sys/vm/legacy_va_layout文件或通过发布适当的sysctl( )
系统调用),内核无法确定用户模式堆栈大小的上限,因此它坚持经典的内存区域布局。
On the other hand, if either the RLIMIT_STACK resource limit is set to
"infinity" or the system administrator has set to 1 the sysctl_legacy_va_layout variable (by
writing in the /proc /sys/vm/legacy_va_layout file or by
issuing the proper sysctl( )
system call), the kernel cannot determine an upper
bound on the size of the User Mode stack, thus it sticks to the
classical memory region layout.
为什么要引入灵活布局?它的主要优点是它允许进程更好地利用用户模式线性地址空间。在经典布局中,堆限制为小于 1 GB,而其他内存区域最多可填充约 2 GB(减去堆栈大小)。在灵活布局中,这些限制都消失了:堆和其他内存区域都可以自由扩展,直到用户模式堆栈未使用的所有线性地址和程序的固定大小段都被占用。
Why has the flexible layout been introduced? Its main advantage is that it allows a process to make better use of the User Mode linear address space. In the classical layout the heap is limited to less than 1 GB, while the other memory regions can fill up to about 2 GB (minus the stack size). In the flexible layout, these constraints are gone: both the heap and the other memory regions can freely expand until all the linear addresses left unused by the User Mode stack and the program's fixed-size segments are taken.
在这一点上,一个小的、实用的实验可能会很有启发。让我们编写并编译以下 C 程序:
At this point, a small, practical experiment can be quite enlightening. Let's write and compile the following C program:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main()
{
字符命令[32];
brk((void *)0x8051000);
sprintf(cmd, "cat /proc/self/maps");
系统(cmd);
返回0;
} #include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main( )
{
char cmd[32];
brk((void *)0x8051000);
sprintf(cmd, "cat /proc/self/maps");
system(cmd);
return 0;
}本质上,程序扩大了进程的堆(请参阅第 9 章中的“管理堆”部分),然后读取/proc特殊文件系统中的映射文件,该文件生成进程本身的内存区域列表。
Essentially, the program enlarges the heap of the process (see the section "Managing the Heap" in Chapter 9), then it reads the maps file in the /proc special filesystem that produces the list of memory regions of the process itself.
让我们运行该程序而不对堆栈大小设置任何限制:
Let's run the program without putting any limit on the stack size:
# ulimit -s 无限;/tmp/内存布局
08048000-08049000 r-xp 00000000 03:03 5042408 /tmp/内存布局
08049000-0804a000 rwxp 00000000 03:03 5042408 /tmp/内存布局
0804a000-08051000 RWXP 0804a000 00:00 0
40000000-40014000 r-xp 00000000 03:03 620801 /lib/ld-2.3.2.so
40014000-40015000 rwxp 00013000 03:03 620801 /lib/ld-2.3.2.so
40015000-40016000 RWXP 40015000 00:00 0
4002f000-40157000 r-xp 00000000 03:03 620804 /lib/libc-2.3.2.so
40157000-4015b000 rwxp 00128000 03:03 620804 /lib/libc-2.3.2.so
4015b000-4015e000 RWXP 4015b000 00:00 0
bffeb000-c0000000 rwxp bffeb000 00:00 0
ffffe000-fffff000 ---p 00000000 00:00 0 # ulimit -s unlimited; /tmp/memorylayout
08048000-08049000 r-xp 00000000 03:03 5042408 /tmp/memorylayout
08049000-0804a000 rwxp 00000000 03:03 5042408 /tmp/memorylayout
0804a000-08051000 rwxp 0804a000 00:00 0
40000000-40014000 r-xp 00000000 03:03 620801 /lib/ld-2.3.2.so
40014000-40015000 rwxp 00013000 03:03 620801 /lib/ld-2.3.2.so
40015000-40016000 rwxp 40015000 00:00 0
4002f000-40157000 r-xp 00000000 03:03 620804 /lib/libc-2.3.2.so
40157000-4015b000 rwxp 00128000 03:03 620804 /lib/libc-2.3.2.so
4015b000-4015e000 rwxp 4015b000 00:00 0
bffeb000-c0000000 rwxp bffeb000 00:00 0
ffffe000-fffff000 ---p 00000000 00:00 0(您可能会看到一个稍微不同的表,具体取决于 C 编译器套件的版本以及程序的链接方式。)前两个十六进制数字表示内存区域的范围;最后两个十六进制数字表示内存区域的范围。它们后面是权限标志;最后,还有一些关于内存区域映射的文件的信息(如果有的话):文件内的起始偏移量、块设备号和索引节点号以及文件名。
(You might see a slightly different table, depending on the version of the C compiler suite and on how the program has been linked.) The first two hexadecimal numbers represent the extent of the memory region; they are followed by the permission flags; finally, there is some information about the file mapped by the memory region, if any: the starting offset inside the file, the block device number and the inode number, and the filename.
请注意,列出的所有区域都是通过私有内存映射(p权限列中的字母)来实现的。这并不奇怪,因为这些内存区域的存在只是为了向进程提供数据。在执行指令时,进程可能会修改这些内存区域的内容;但是,磁盘上与它们关联的文件保持不变。这正是私有内存映射的作用方式。
Notice that all regions listed are implemented by means of
private memory mappings (the letter p in the permission column). This is not
surprising because these memory regions exist only to provide data
to a process. While executing instructions, a process may modify the
contents of these memory regions; however, the files on disk
associated with them stay unchanged. This is precisely how private
memory mappings act.
从 开始的内存区域是与/tmp/memorylayout文件的从字节 0 到字节 4,095 的0x8048000部分相关联的内存映射。权限指定该区域是可执行的(它包含目标代码)、只读的(它是不可写的,因为指令在运行期间不会改变)和私有的。这是正确的,因为该区域映射了程序的文本段。
The memory region starting from 0x8048000 is a memory mapping associated
with the portion of the /tmp/memorylayout file ranging from byte
0 to byte 4,095. The permissions specify that the region is
executable (it contains object code), read-only (it's not writable
because the instructions don't change during a run), and private.
That's correct, because the region maps the text segment of the
program.
起始的内存区域是与/tmp/memorylayout0x8049000的同一部分关联的另一个内存映射,范围从字节 0 到字节 4,095。该程序非常小,以至于程序的文本、数据和 bss 段都包含在同一文件的页面中。因此,包含数据和 bss 段的内存区域与线性地址空间中的先前内存区域重叠。
The memory region starting from 0x8049000 is another memory mapping
associated with the same portion of /tmp/memorylayout ranging from byte 0 to
byte 4,095. This program is so small that the text, data, and bss
segments of the program are included in the same file's page. Thus,
the memory region containing the data and bss segments overlaps with
the previous memory region in the linear address space.
第三个内存区域包含进程的堆。0x8051000请注意,它终止于传递给的线性地址brk( ) 系统调用。
The third memory region contains the heap of the process.
Notice that it terminates at the linear address 0x8051000 that was passed to the brk( ) system call.
接下来的两个内存区域分别0x40000000对应0x40014000于 ELF 共享库动态链接器的文本段、数据段和 bss 段(在此系统上为/lib/ld-2.3.2.so )。动态链接器永远不会单独执行:它始终在执行另一个程序的进程的地址空间内进行内存映射。从 开始的匿名内存区域0x40015000已由动态链接器分配。
The next two memory regions starting from 0x40000000 and 0x40014000 correspond to the text segment
and to the data and bss segments, respectively, of the dynamic
linker for the ELF shared libraries—/lib/ld-2.3.2.so on this system. The
dynamic linker is never executed alone: it is always memory-mapped
inside the address space of a process executing another program. The
anonymous memory region starting from 0x40015000 has been allocated by the
dynamic linker.
在此系统上,C 库恰好存储在
/lib/libc-2.3.2.so文件中。C 库的 text 段以及 data 和 bss 段被映射到从地址 开始的接下来的两个内存区域
0x4002f000。请记住,私有区域中包含的页框可以通过写入时复制机制在多个进程之间共享,只要它们不被修改。因此,由于文本段是只读的,包含 C 库可执行代码的页框在几乎所有当前正在执行的进程(静态链接进程除外)之间共享。从 开始的匿名内存区域
0x4015b000已由 C 库分配。
On this system, the C library happens to be stored in the
/lib/libc-2.3.2.so file. The
text segment and the data and bss segments of the C library are
mapped into the next two memory regions, starting from address
0x4002f000. Remember that page
frames included in private regions can be shared among several
processes with the Copy On Write mechanism, as long as they are not
modified. Thus, because the text segment is read-only, the page
frames containing the executable code of the C library are shared
among almost all currently executing processes (all except the
statically linked ones). The anonymous memory region starting from
0x4015b000 has been allocated by
the C library.
0xbffeb000从到 的匿名内存区域0xc0000000与用户模式堆栈相关联。我们已经在第 9 章的“页面错误异常处理程序”部分中解释了堆栈如何在必要时自动向低地址扩展。
The anonymous memory region from 0xbffeb000 to 0xc0000000 is associated with the User
Mode stack. We already explained in the section "Page Fault Exception
Handler" in Chapter
9 how the stack is automatically expanded toward lower
addresses whenever necessary.
最后,单页匿名内存区域包含0xffffe000进程的 vsyscall 页,在发出系统调用并从信号处理程序返回时访问该页(请参阅第10 章中的“通过 sysenter 指令发出系统调用”一节和第 11 章中的“捕获信号”部分)。
Finally, the one-page anonymous memory region from 0xffffe000 contains the vsyscall page of
the process, which is accessed when issuing a system call and
returning from a signal handler (see the section "Issuing a System Call via the
sysenter Instruction" in Chapter 10 and the section
"Catching the
Signal" in Chapter
11).
现在让我们通过对用户模式堆栈的大小施加限制来运行相同的程序:
Now let's run the same program by enforcing a limit on the size of the User Mode stack:
# ulimit -s 100; /tmp/内存布局
08048000-08049000 r-xp 00000000 03:03 5042408 /tmp/内存布局
08049000-0804a000 rwxp 00000000 03:03 5042408 /tmp/内存布局
0804a000-08051000 RWXP 0804a000 00:00 0
b7ea3000-b7fcb000 r-xp 00000000 03:03 620804 /lib/libc-2.3.2.so
b7fcb000-b7fcf000 rwxp 00128000 03:03 620804 /lib/libc-2.3.2.so
b7fcf000-b7fd2000 rwxp b7fcf000 00:00 0
b7feb000-b7fec000 rwxp b7feb000 00:00 0
b7fec000-b8000000 r-xp 00000000 03:03 620801 /lib/ld-2.3.2.so
b8000000-b8001000 rwxp 00013000 03:03 620801 /lib/ld-2.3.2.so
bffeb000-c0000000 rwxp bffeb000 00:00 0
ffffe000-fffff000 ---p 00000000 00:00 0 # ulimit -s 100; /tmp/memorylayout
08048000-08049000 r-xp 00000000 03:03 5042408 /tmp/memorylayout
08049000-0804a000 rwxp 00000000 03:03 5042408 /tmp/memorylayout
0804a000-08051000 rwxp 0804a000 00:00 0
b7ea3000-b7fcb000 r-xp 00000000 03:03 620804 /lib/libc-2.3.2.so
b7fcb000-b7fcf000 rwxp 00128000 03:03 620804 /lib/libc-2.3.2.so
b7fcf000-b7fd2000 rwxp b7fcf000 00:00 0
b7feb000-b7fec000 rwxp b7feb000 00:00 0
b7fec000-b8000000 r-xp 00000000 03:03 620801 /lib/ld-2.3.2.so
b8000000-b8001000 rwxp 00013000 03:03 620801 /lib/ld-2.3.2.so
bffeb000-c0000000 rwxp bffeb000 00:00 0
ffffe000-fffff000 ---p 00000000 00:00 0请注意布局如何变化:动态链接器已映射到最高堆栈地址上方约 128 MB。此外,由于 C 库的内存区域是稍后创建的,因此它们获得较低的线性地址。
Notice how the layout has changed: the dynamic linker has been mapped about 128 MB above the highest stack address. Furthermore, because the memory regions of the C library have been created later, they get lower linear addresses.
执行跟踪是一种允许程序监视另一个程序执行的技术。被跟踪的程序可以逐步执行,直到收到信号,或者直到调用系统调用。执行跟踪与其他技术(例如在被调试程序中插入断点以及运行时访问其变量)一起被调试器广泛使用。我们重点关注内核如何支持执行跟踪而不是讨论调试器如何工作。
Execution tracing is a technique that allows a program to monitor the execution of another program. The traced program can be executed step by step, until a signal is received, or until a system call is invoked. Execution tracing is widely used by debuggers, together with other techniques such as the insertion of breakpoints in the debugged program and runtime access to its variables. We focus on how the kernel supports execution tracing rather than discussing how debuggers work.
在Linux中,执行跟踪是通过系统调用来进行的,它可以处理表20-5ptrace( )列出的命令。设置了功能标志的进程可以跟踪系统中除
init之外的每个进程。相反,没有能力的进程
P只能跟踪与P具有相同所有者的进程。而且,一个进程不能同时被两个进程跟踪。CAP_SYS_PTRACECAP_SYS_PTRACE
In Linux, execution tracing is performed through the ptrace( ) system call, which can handle the
commands listed in Table
20-5. Processes having the CAP_SYS_PTRACE capability flag set are
allowed to trace every process in the system except
init. Conversely, a process
P with no CAP_SYS_PTRACE capability is allowed to
trace only processes having the same owner as P.
Moreover, a process cannot be traced by two processes at the same
time.
表 20-5。80 × 86 架构中的 ptrace 命令
Table 20-5. The ptrace commands in the 80 × 86 architecture
命令 Command | 描述 Description |
|---|---|
| 启动另一个进程的执行跟踪 Start execution tracing for another process |
| 恢复执行 Resume execution |
| 终止执行跟踪 Terminate execution tracing |
PTRACE_GET_THREAD_AREA PTRACE_GET_THREAD_AREA | 代表被跟踪进程获取线程本地存储(TLS)区域 Get the Thread Local Storage (TLS) area on behalf of the traced process |
PTRACE_GETEVENTMSG PTRACE_GETEVENTMSG | 从跟踪的进程中获取附加数据(例如,新分叉进程的 PID) Get additional data from the traced process (e.g., the PID of a newly forked process) |
| 读取浮点寄存器 Read floating point registers |
| |
| 读取特权CPU的寄存器 Read privileged CPU's registers |
PTRACE_GETSIGINFO PTRACE_GETSIGINFO | 获取有关传递到跟踪进程的最后一个信号的信息 Get information on the last signal delivered to the traced process |
| 杀死跟踪的进程 Kill the traced process |
PTRACE_OLDSETOPTIONS PTRACE_OLDSETOPTIONS | 依赖于体系结构的命令相当于 Architecture-dependent command
equivalent to |
| 从数据段读取一个32位值 Read a 32-bit value from the data segment |
| 从文本段中读取 32 位值 Read a 32-bit value from the text segment |
| |
| 将32位值写入数据段 Write a 32-bit value into the data segment |
| 将 32 位值写入文本段 Write a 32-bit value into the text segment |
| 写入CPU的正常寄存器和调试寄存器 Write the CPU's normal and debug registers |
PTRACE_SET_THREAD_AREA PTRACE_SET_THREAD_AREA | 代表被跟踪进程设置线程本地存储(TLS)区域 Set the Thread Local Storage (TLS) area on behalf of the traced process |
| 写入浮点寄存器 Write floating point registers |
| 写入MMX和XMM寄存器 Write MMX and XMM registers |
| 修改 Modify |
| 写入特权CPU的寄存器 Write privileged CPU's registers |
PTRACE_SETSIGINFO PTRACE_SETSIGINFO | 伪造传递给被跟踪进程的最后一个信号的信息 Forge the information on the last signal delivered to the traced process |
| 恢复执行单个汇编语言指令 Resume execution for a single assembly language instruction |
| 恢复执行直到下一个系统调用边界 Resume execution until the next system call boundary |
| 启动当前进程的执行跟踪 Start execution tracing for the current process |
系统ptrace( )调用修改parent被跟踪进程描述符中的字段,使其指向跟踪进程;因此,跟踪进程成为被跟踪进程的有效父进程。当执行跟踪终止时(即
ptrace( )用
PTRACE_DETACH命令调用时),系统调用将设置p_pptr为 的值
real_parent,从而恢复被跟踪进程的原始父进程(请参阅第 3 章中的“进程之间的关系”部分)。
The ptrace( ) system call
modifies the parent field in the
descriptor of the traced process so that it points to the tracing
process; therefore, the tracing process becomes the effective parent
of the traced one. When execution tracing terminates—i.e., when
ptrace( ) is invoked with the
PTRACE_DETACH command—the system
call sets p_pptr to the value of
real_parent, thus restoring the
original parent of the traced process (see the section "Relationships Among
Processes" in Chapter
3).
多个受监视的事件可以与跟踪的程序相关联:
Several monitored events can be associated with a traced program:
单个汇编语言指令的执行结束
End of execution of a single assembly language instruction
进入系统调用
Entering a system call
退出系统调用
Exiting from a system call
接收信号
Receiving a signal
当受监视的事件发生时,被跟踪的程序将停止并向SIGCHLD其父程序发送信号。PTRACE_CONT当父级希望恢复子级的执行时,它可以使用、
PTRACE_SINGLESTEP和命令之一PTRACE_SYSCALL,具体取决于它想要监视的事件类型。
When a monitored event occurs, the traced program is stopped and
a SIGCHLD signal is sent to its
parent. When the parent wishes to resume the child's execution, it can
use one of the PTRACE_CONT,
PTRACE_SINGLESTEP, and PTRACE_SYSCALL commands, depending on the
kind of event it wants to monitor.
该PTRACE_CONT命令只是恢复执行;子进程执行直到收到另一个信号。这种跟踪是通过进程描述符字段PT_PTRACED中的标志来实现的,该标志由函数检查(参见第11章中的“传递信号”
部分)。ptracedo_signal(
)
The PTRACE_CONT command
simply resumes execution; the child executes until it receives another
signal. This kind of tracing is implemented by means of the PT_PTRACED flag in the ptrace field of the process descriptor,
which is checked by the do_signal(
) function (see the section "Delivering a Signal" in
Chapter 11).
该PTRACE_SINGLESTEP命令强制子进程执行下一条汇编语言指令,然后再次停止它。这种跟踪是在基于 80 × 86 的机器上通过TF陷阱标志来实现的。eflags 注册:当它打开时,会出现一个“调试“在每个汇编语言指令之后都会引发异常。相应的异常处理程序只是清除标志,强制当前进程停止,并向SIGCHLD其父进程发送信号。请注意,设置标志TF不是特权操作,因此用户模式处理即使没有系统调用也可以强制单步执行ptrace(
)。内核检查PT_DTRACE进程描述符中的标志来跟踪子进程是否正在单步执行ptrace( )。
The PTRACE_SINGLESTEP command
forces the child process to execute the next assembly language
instruction, and then stops it again. This kind of tracing is
implemented on 80 × 86-based machines by means of the TF trap flag in the eflags register: when it is on, a "Debug " exception is raised right after every assembly
language instruction. The corresponding exception handler just clears
the flag, forces the current process to stop, and sends a SIGCHLD signal to its parent. Notice that
setting the TF flag is not a
privileged operation, so User Mode processes can force single-step
execution even without the ptrace(
) system call. The kernel checks the PT_DTRACE flag in the process descriptor to
keep track of whether the child process is being single-stepped
through ptrace( ).
该PTRACE_SYSCALL命令使跟踪的进程恢复执行,直到调用系统调用。进程停止两次:第一次是系统调用开始时,第二次是系统调用终止时。这种跟踪是通过进程结构体字段TIF_SYSCALL_TRACE中包含的标志
来实现的,该标志在汇编语言函数中进行检查(参见第10章“通过int $0x80指令发出系统调用”
一节) 。flagsthread_infosystem_call( )
The PTRACE_SYSCALL command
causes the traced process to resume execution until a system call is
invoked. The process is stopped twice: the first time when the system
call starts and the second time when the system call terminates. This
kind of tracing is implemented by means of the TIF_SYSCALL_TRACE flag included in the
flags field of the thread_info structure of the process, which
is checked in the system_call( )
assembly language function (see the section "Issuing a System Call via the
int $0x80 Instruction" in Chapter 10).
还可以使用 Intel Pentium 处理器的一些调试功能来跟踪进程。例如,父级可以使用命令为子级设置调试寄存器dr0的值。当调试寄存器监视的事件发生时,CPU 会引发“调试”异常;然后,异常处理程序可以挂起跟踪的进程并将信号发送给父进程。dr7PTRACE_POKEUSRSIGCHLD
A process can also be traced using some debugging features of
the Intel Pentium processors. For example, the parent could set the
values of the dr0,..., dr7 debug registers for the child by using
the PTRACE_POKEUSR command. When an
event monitored by a debug register occurs, the CPU raises the "Debug"
exception; the exception handler can then suspend the traced process
and send the SIGCHLD signal to the
parent.
[ * ]可执行文件的路径名Linux 中没有固定;它们取决于所使用的发行版。几种标准命名方案,例如 文件系统层次结构标准 ( FHS ),已被提议用于所有 Unix 系统。
[*] The pathnames of executable files are not fixed in Linux; they depend on the distribution used. Several standard naming schemes, such as Filesystem Hierarchy Standard (FHS), have been proposed for all Unix systems.
[ * ]默认情况下,进程已打开的文件在发出execve( )系统调用后保持打开状态。然而,如果进程在close_on_exec结构体字段中设置了相应位,则文件将自动关闭files_struct(参见第12章表12-7);这是通过系统调用完成的。fcntl( )
[*] By default, a file already opened by a process stays open
after issuing an execve( ) system
call. However, the file is automatically closed if the process has
set the corresponding bit in the close_on_exec field of the files_struct structure (see Table 12-7 in Chapter 12); this is done by
means of the fcntl( ) system
call.
[ * ]setgid( )可以通过发出相应的、
setresgid( )、setfsgid( )和setregid( )系统调用来更改组的有效凭证。
[*] A group's effective credentials can be changed by issuing
the corresponding setgid( ),
setresgid( ), setfsgid( ), and setregid( ) system calls.
[ * ] “段”这个词有历史根源,因为第一个 Unix 系统使用不同的段寄存器实现每个线性地址间隔。然而Linux并不依赖80×86微处理器的分段机制来实现程序分段。
[*] The word "segment" has historical roots, because the first Unix systems implemented each linear address interval with a different segment register. Linux, however, does not rely on the segmentation mechanism of the 80 × 86 microprocessors to implement program segments.
标准 Linux 可执行文件格式称为 可执行和链接格式 ( ELF )。它由 Unix 系统实验室开发,现在是 Unix 世界中使用最广泛的格式。几个著名的Unix操作系统,例如System VRelease 4 和 Sun 的 Solaris2、采用ELF作为其主要可执行格式。
The standard Linux executable format is named Executable and Linking Format ( ELF). It was developed by Unix System Laboratories and is now the most widely used format in the Unix world. Several well-known Unix operating systems, such as System V Release 4 and Sun's Solaris 2, have adopted ELF as their main executable format.
较旧的 Linux 版本支持另一种名为 Assembler OUTput Format ( a.out ) 的格式;实际上,Unix 世界中存在该格式的多个版本。现在很少用了,因为ELF实用多了。
Older Linux versions supported another format named Assembler OUTput Format(a.out); actually, there were several versions of that format floating around the Unix world. It is seldom used now, because ELF is much more practical.
Linux 支持许多其他不同格式的可执行文件;这样,它就可以运行为其他操作系统编译的程序,例如MS-DOSEXE 程序或 BSDUnix 的 COFF 可执行文件。一些可执行格式(例如 Java 或bash脚本)是与平台无关的。
Linux supports many other different formats for executable files; in this way, it can run programs compiled for other operating systems, such as MS-DOS EXE programs or BSD Unix's COFF executables. A few executable formats, such as Java or bash scripts, are platform-independent.
可执行格式由 类型的对象描述linux_binfmt,该对象本质上提供了三种方法:
An executable format is described by an object of type linux_binfmt, which essentially provides three
methods:
load_binaryload_binary通过读取可执行文件中存储的信息为当前进程设置新的执行环境。
Sets up a new execution environment for the current process by reading the information stored in an executable file.
load_shlibload_shlibDynamically binds a shared library to an already running
process; it is activated by the uselib(
) system call.
core_dumpcore_dump将当前进程的执行上下文存储在名为core. 该文件的格式取决于正在执行的程序的可执行文件类型,通常在进程收到默认操作为“转储”的信号时创建(请参阅第 11 章中的“传递信号时执行的操作”部分)。
Stores the execution context of the current process in a
file named core. This file,
whose format depends on the type of executable of the program
being executed, is usually created when a process receives a
signal whose default action is "dump" (see the section "Actions Performed upon
Delivering a Signal" in Chapter 11).
所有linux_binfmt对象都包含在一个单链表中,第一个元素的地址存储在变量中formats。register_binfmt( )可以通过调用和函数在列表中插入和删除元素
unregister_binfmt( )。该register_binfmt( )函数在系统启动期间针对编译到内核中的每种可执行格式执行。当加载实现新的可执行格式的模块时也会执行此函数,而unregister_binfmt( )在卸载模块时会调用该函数。
All linux_binfmt objects are
included in a singly linked list, and the address of the first element
is stored in the formats variable.
Elements can be inserted and removed in the list by invoking the
register_binfmt( ) and unregister_binfmt( ) functions. The register_binfmt( ) function is executed during
system startup for each executable format compiled into the kernel. This
function is also executed when a module implementing a new executable
format is being loaded, while the unregister_binfmt( ) function is invoked when
the module is unloaded.
列表中的最后一个元素formats
始终是描述
解释脚本的可执行格式的对象 。此格式仅定义load_binary方法。相应的load_script( )函数检查可执行文件是否以该对#!
字符开头。如果是这样,它将第一行的其余部分解释为另一个可执行文件的路径名,并尝试通过将脚本文件的名称作为参数传递来执行它。[ * ]
The last element in the formats
list is always an object describing the executable format for
interpreted scripts . This format defines only the load_binary method. The corresponding load_script( ) function checks whether the
executable file starts with the #!
pair of characters. If so, it interprets the rest of the first line as
the pathname of another executable file and tries to execute it by
passing the name of the script file as a parameter.[*]
Linux 允许用户注册自己的自定义可执行格式。每种此类格式都可以通过存储在文件前 128 个字节中的幻数或通过标识文件类型的文件扩展名来识别。例如,MS-DOS 扩展名由用点与文件名分隔的三个字符组成:.exe 扩展名标识可执行程序,而.bat 扩展名标识 shell 脚本。
Linux allows users to register their own custom executable formats. Each such format may be recognized either by means of a magic number stored in the first 128 bytes of the file, or by a filename extension that identifies the file type. For example, MS-DOS extensions consist of three characters separated from the filename by a dot: the .exe extension identifies executable programs, while the .bat extension identifies shell scripts.
当内核确定可执行文件具有自定义格式时,它会启动适当的解释程序。解释器程序在用户模式下运行,接收可执行文件的路径名作为其参数,并进行计算。例如,包含 Java 程序的可执行文件由 java 虚拟机处理,例如 /usr/lib/java/bin/java。
When the kernel determines that the executable file has a custom format, it starts the proper interpreter program . The interpreter program runs in User Mode, receives as its parameter the pathname of the executable file, and carries on the computation. As an example, an executable file containing a Java program is dealt by a java virtual machine such as /usr/lib/java/bin/java.
该机制类似于脚本的格式,但它更强大,因为它不对自定义格式施加任何限制。要注册新格式,用户写入binfmt_misc的 寄存器文件 特殊文件系统(通常安装在/proc/sys/fs/binfmt_misc)具有以下格式的字符串:
The mechanism is similar to the script's format, but it's more powerful because it doesn't impose any restrictions on the custom format. To register a new format, the user writes into the register file of the binfmt_misc special filesystem (usually mounted on /proc/sys/fs/binfmt_misc) a string with the following format:
:名称:类型:偏移:字符串:掩码:解释器:标志
:name:type:offset:string:mask:interpreter:flags
其中每个字段的含义如下:
where each field has the following meaning:
namename新格式的标识符
An identifier for the new format
typetype识别类型(M对于幻数,E对于扩展)
The type of recognition (M for magic number, E for extension)
offsetoffset文件内幻数的起始偏移量
The starting offset of the magic number inside the file
stringstring幻数或扩展名中要匹配的字节序列
The byte sequence to be matched either in the magic number or in the extension
maskmask要屏蔽掉某些位的字符串string
The string to mask out some bits in string
interpreterinterpreter解释器程序的完整路径名
The full pathname of the interpreter program
flagsflags一些控制如何调用解释器程序的可选标志
Some optional flags that control how the interpreter program has to be invoked
例如,超级用户执行以下命令使内核能够识别 Microsoft Windows可执行格式:
For example, the following command performed by the superuser enables the kernel to recognize the Microsoft Windows executable format:
$ echo ':DOSWin:M:0:MZ:0xff:/usr/bin/wine:'
> /proc/sys/fs/binfmt_misc/register $ echo ':DOSWin:M:0:MZ:0xff:/usr/bin/wine:'
> /proc/sys/fs/binfmt_misc/registerWindows 可执行文件的前两个字节中包含 MZ 幻数,它由/usr/bin/wine解释程序执行。
A Windows executable file has the MZ magic number in the first two bytes, and it is executed by the /usr/bin/wine interpreter program.
[ * ]即使脚本文件不是以字符开头#!,只要该文件是用命令 shell 识别的语言编写的,就可以执行该文件。然而,在这种情况下,脚本由用户键入命令的 shell 或默认的 Bourne shell sh解释。因此,内核并不直接参与。
[*] It is possible to execute a script file even if it doesn't
start with the #! characters, as
long as the file is written in the language recognized by a command
shell. In this case, however, the script is interpreted either by
the shell on which the user types the command or by the default
Bourne shell sh; therefore, the
kernel is not directly involved.
正如第 1 章中提到的,Linux 的一个巧妙功能是它能够执行为其他操作系统编译的文件。当然,只有当文件包含运行内核的同一计算机体系结构的机器代码时,这才是可能的。为这些“外国”项目提供两种支持:
As mentioned in Chapter 1, a neat feature of Linux is its ability to execute files compiled for other operating systems. Of course, this is possible only if the files include machine code for the same computer architecture on which the kernel is running. Two kinds of support are offered for these "foreign" programs:
模拟执行:执行包含不兼容 POSIX 的系统调用的程序所必需的
Emulated execution: necessary to execute programs that include system calls that are not POSIX-compliant
本机执行:对于系统调用完全符合 POSIX 的程序有效
Native execution: valid for programs whose system calls are totally POSIX-compliant
微软 MS-DOS 和 Windows程序是模拟的:它们不能在本机执行,因为它们包含 Linux 无法识别的 API。调用诸如 DOSemu 或 Wine(出现在上一节末尾的示例中)之类的模拟器将每个 API 调用转换为模拟包装函数调用,后者又使用现有的 Linux 系统调用。由于模拟器主要作为用户模式应用程序实现,因此我们不会进一步讨论它们。
Microsoft MS-DOS and Windows programs are emulated: they cannot be natively executed, because they include APIs that are not recognized by Linux. An emulator such as DOSemu or Wine (which appeared in the example at the end of the previous section) is invoked to translate each API call into an emulating wrapper function call, which in turn uses the existing Linux system calls. Because emulators are mostly implemented as User Mode applications, we don't discuss them further.
另一方面,在 Linux 以外的操作系统上编译的 POSIX 兼容程序可以毫不费力地执行,因为 POSIX 操作系统提供类似的 API。(实际上,API 应该是相同的,尽管情况并非总是如此。)内核必须消除的微小差异通常涉及系统调用的调用方式或各种信号的编号方式。该信息存储在执行域描述符中 类型exec_domain.
On the other hand, POSIX-compliant programs compiled on operating
systems other than Linux can be executed without too much trouble,
because POSIX operating systems offer similar APIs. (Actually, the APIs
should be identical, although this is not always the case.) Minor
differences that the kernel must iron out usually refer to how system
calls are invoked or how the various signals are numbered. This
information is stored in execution domain
descriptors of type exec_domain.
进程通过设置来指定其执行域personality 其描述符的字段,并在该结构的字段中存储相应exec_domain数据结构的地址。进程可以通过发出名为的合适的系统调用来改变其个性
exec_domainthread_infopersonality( ) ; 系统调用参数假定的典型值列于表 20-6中。程序员不应该直接改变他们程序的个性;相反,personality( )系统调用应该由设置进程执行上下文的粘合代码发出(请参阅下一节)。
A process specifies its execution domain by setting the personality field of its descriptor and storing the address of the
corresponding exec_domain data
structure in the exec_domain field of
the thread_info structure. A process
can change its personality by issuing a suitable system call named
personality( ) ; typical values assumed by the system call's parameter
are listed in Table
20-6. Programmers are not expected to directly change the
personality of their programs; instead, the personality( ) system call should be issued by
the glue code that sets up the execution context of the process (see the
next section).
表 20-6。Linux 内核支持的个性
Table 20-6. Personalities supported by the Linux kernel
性格 Personality | 操作系统 Operating system |
|---|---|
| 标准执行域 Standard execution domain |
| 64 位架构中具有 32 位物理地址的 Linux Linux with 32-bit physical addresses in 64-bit architectures |
| ELF FDPIC 格式的 Linux 程序 Linux program in ELF FDPIC format |
| |
| 系统 V 版本 3 System V Release 3 |
| |
| |
| Unix 系统 V/386 版本 3.2.1 Unix System V/386 Release 3.2.1 |
| |
| |
| |
| |
| 在 64 位架构中模拟 Linux 32 位程序(使用 4 GB 用户模式地址空间) Emulation of Linux 32-bit programs in 64-bit architectures (using a 4 GB User Mode address space) |
| 在 64 位架构中模拟 Linux 32 位程序(使用 3 GB 用户模式地址空间) Emulation of Linux 32-bit programs in 64-bit architectures (using a 3 GB User Mode address space) |
| |
| SGI IRIX-6 32位 SGI IRIX-6 32 bit |
| SGI IRIX-6 64 位 SGI IRIX-6 64 bit |
| |
| |
| |
| |
|
Unix 系统提供了一系列函数,用可执行文件描述的新上下文替换进程的执行上下文。这些函数的名称以前缀 开头
exec,后跟一两个字母;因此,族中的通用函数通常被称为函数
exec。
Unix systems provide a family of functions that replace
the execution context of a process with a new context described by an
executable file. The names of these functions start with the prefix
exec, followed by one or two letters;
therefore, a generic function in the family is usually referred to as an
exec function.
功能exec如表20-7所示;它们的不同之处在于参数的解释方式。
The exec functions are listed
in Table 20-7; they
differ in how the parameters are interpreted.
每个函数的第一个参数表示要执行的文件的路径名。路径名可以是绝对路径,也可以是相对于进程当前目录的路径。此外,如果名称不包含任何 / 字符,则execlp( )和
execvp( )函数会在环境变量指定的所有目录中搜索可执行文件PATH。
The first parameter of each function denotes the pathname of the
file to be executed. The pathname can be absolute or relative to the
process's current directory. Moreover, if the name does not include any
/ characters, the execlp( ) and
execvp( ) functions search for the
executable file in all directories specified by the PATH environment variable.
除了第一个参数之外,execl(
)、execlp( )和execle( )函数还包含数量可变的附加参数。每个都指向一个描述新程序的命令行参数的字符串;正如函数名称中的“ l”字符所暗示的那样,参数被组织在一个以NULL值结尾的列表中。通常,第一个命令行参数与可执行文件名重复。相反,execv( )、execvp(
)和execve( ) 函数用单个参数指定命令行参数;正如v函数名称中的字符所暗示的那样,参数是指向命令行参数字符串的指针向量的地址。数组的最后一个分量必须是NULL。
Besides the first parameter, the execl(
), execlp( ), and execle( ) functions include a variable number
of additional parameters. Each points to a string describing a
command-line argument for the new program; as the "l" character in the function names suggests,
the parameters are organized in a list terminated by a NULL value. Usually, the first command-line
argument duplicates the executable filename. Conversely, the execv( ), execvp(
), and execve( ) functions specify the command-line arguments with a
single parameter; as the v character
in the function names suggests, the parameter is the address of a vector
of pointers to command-line argument strings. The last component of the
array must be NULL.
和函数接收指向环境字符串的execle( )指针execve( )数组的地址作为其最后一个参数;像往常一样,数组的最后一个组件必须是NULL。environ其他函数可以从 C 库中定义的外部全局变量访问新程序的环境。
The execle( ) and execve( ) functions receive as their last
parameter the address of an array of pointers to environment strings; as
usual, the last component of the array must be NULL. The other functions may access the
environment for the new program from the external environ global variable, which is defined in
the C library.
exec除 之外的所有函数execve( )都是包装例程在 C 库中定义并使用execve( ),这是 Linux 提供的唯一用于处理程序执行的系统调用。
All exec functions, with the
exception of execve( ), are wrapper
routines defined in the C library and use execve( ), which is the only system call
offered by Linux to deal with program execution.
服务sys_execve( )例程接收以下参数:
The sys_execve( ) service
routine receives the following parameters:
可执行文件路径名的地址(在用户模式地址空间中)。
The address of the executable file pathname (in the User Mode address space).
的地址NULL(同样在用户模式地址空间中)的指针终止数组(在用户模式地址空间中)的地址;每个字符串代表一个命令行参数。
The address of a NULL-terminated array (in the User Mode
address space) of pointers to strings (again in the User Mode
address space); each string represents a command-line
argument.
指向字符串NULL(同样在用户模式地址空间中)的指针终止数组(在用户模式地址空间中)的地址;每个字符串代表格式中的一个环境变量NAME=value。
The address of a NULL-terminated array (in the User Mode
address space) of pointers to strings (again in the User Mode
address space); each string represents an environment variable in
the NAME=value format.
该函数将可执行文件路径名复制到新分配的页帧中。然后它调用该do_execve( )函数,将指向页框、指针数组的指针以及保存用户模式寄存器内容的内核模式堆栈的位置传递给它。依次do_execve( )执行以下操作:
The function copies the executable file pathname into a newly
allocated page frame. It then invokes the do_execve( ) function, passing to it the
pointers to the page frame, to the pointer's arrays, and to the location
of the Kernel Mode stack where the User Mode register contents are
saved. In turn, do_execve( ) performs
the following operations:
动态分配一个linux_binprm数据结构,该结构将填充有关新可执行文件的数据。
Dynamically allocates a linux_binprm data structure, which will be
filled with data concerning the new executable file.
调用path_lookup( )、
dentry_open( )、 和path_release( )来获取与可执行文件关联的 dentry 对象、文件对象和 inode 对象。失败时,它返回正确的错误代码。
Invokes path_lookup( ),
dentry_open( ), and path_release( ) to get the dentry object,
the file object, and the inode object associated with the executable
file. On failure, it returns the proper error code.
验证当前进程是否可以执行该文件;i_writecount另外,通过查看inode 字段来检查文件是否未被写入
;存储-1在该字段中以禁止进一步的写访问。
Verifies that the file is executable by the current process;
also, checks that the file is not being written by looking at the
i_writecount field of the inode;
stores -1 in that field to forbid
further write accesses.
在多处理器系统中,它调用sched_exec( )函数来确定可以执行新程序的负载最少的CPU,并将当前进程迁移到它(参见第7章)。
In multiprocessor systems, it invokes the sched_exec( ) function to determine the
least loaded CPU that can execute the new program and to migrate the
current process to it (see Chapter 7).
调用以检查当前进程是否正在使用自定义本地描述符表(请参阅第 2 章中的“ Linux LDT ”
init_new_context( )
部分);在这种情况下,该函数分配并填充一个新的 LDT 以供新程序使用。
Invokes init_new_context( )
to check whether the current process was using a custom Local
Descriptor Table (see the section "The Linux LDTs" in
Chapter 2); in this case,
the function allocates and fills a new LDT to be used by the new
program.
调用prepare_binprm(
)函数来填充linux_binprm数据结构。该函数依次执行以下操作:
再次检查文件是否可执行(至少设置了一项执行访问权限);如果不是,则返回错误代码。(步骤 3 中的先前检查是不够的,因为具有功能集的进程始终满足检查要求;请参阅本章前面的“进程凭证和功能CAP_DAC_OVERRIDE”部分)。
初始化结构体的e_uid
和e_gid字段
,同时考虑
可执行文件的setuid和setgidlinux_binprm标志的值
。这些字段分别表示有效用户和组 ID。还检查进程功能(在前面的“进程凭证和功能”部分中解释了兼容性黑客)。
使用可执行文件的前 128 个字节填充结构buf的字段。linux_binprm这些字节包括可执行格式的幻数以及适合识别可执行文件的其他信息。
Invokes the prepare_binprm(
) function to fill the linux_binprm data structure. This
function, in turn, performs the following operations:
Checks again whether the file is executable (at least one
execute access right is set); if not, returns an error code.
(The previous check in step 3 is not sufficient because a
process with the CAP_DAC_OVERRIDE capability set always
satisfies the check; see the section "Process Credentials and
Capabilities" earlier in this chapter).
Initializes the e_uid
and e_gid fields of the
linux_binprm structure,
taking into account the values of the
setuid and setgid
flags of the executable file. These fields represent the
effective user and group IDs, respectively. Also checks process
capabilities (a compatibility hack explained in the earlier
section "Process
Credentials and Capabilities").
Fills the buf field of
the linux_binprm structure
with the first 128 bytes of the executable file. These bytes
include the magic number of the executable format and other
information suitable for recognizing the executable file.
将文件路径名、命令行参数和环境字符串复制到一个或多个新分配的页框中。(最终,它们被分配给用户模式地址空间。)
Copies the file pathname, command-line arguments, and environment strings into one or more newly allocated page frames. (Eventually, they are assigned to the User Mode address space.)
调用该search_binary_handler(
)函数,该函数扫描formats列表并尝试应用
load_binary每个元素的方法,并将数据结构传递给它linux_binprm。formats一旦load_binary方法成功确认文件的可执行格式,列表扫描就会终止。
Invokes the search_binary_handler(
) function, which scans the formats list and tries to apply the
load_binary method of each
element, passing to it the linux_binprm data structure. The scan of
the formats list terminates as
soon as a load_binary method
succeeds in acknowledging the executable format of the file.
如果可执行文件格式不存在于列表中formats,则它释放所有分配的页框并返回错误代码-ENOEXEC。Linux 无法识别可执行文件格式。
If the executable file format is not present in the formats list, it releases all allocated
page frames and returns the error code -ENOEXEC. Linux cannot recognize the
executable file format.
否则,该函数释放数据结构并返回从与文件的可执行格式关联的方法linux_binprm获得的代码。load_binary
Otherwise, the function releases the linux_binprm data structure and returns
the code obtained from the load_binary method associated with the
executable format of the file.
与可执行文件格式对应的方法load_binary执行以下操作(我们假设可执行文件存储在允许文件内存映射的文件系统上,并且它需要一个或多个共享库):
The load_binary method
corresponding to an executable file format performs the following
operations (we assume that the executable file is stored on a filesystem
that allows file memory mapping and that it requires one or more shared
libraries):
检查文件前 128 字节中存储的一些幻数以识别可执行格式。如果幻数不匹配,则返回错误代码-ENOEXEC。
Checks some magic numbers stored in the first 128 bytes of the
file to identify the executable format. If the magic numbers don't
match, it returns the error code -ENOEXEC.
读取可执行文件的标头。该标头描述了程序的段和请求的共享库。
Reads the header of the executable file. This header describes the program's segments and the shared libraries requested.
从可执行文件中获取动态链接器的路径名,该路径名用于定位共享库并将其映射到内存中。
Gets from the executable file the pathname of the dynamic linker, which is used to locate the shared libraries and map them into memory.
获取动态链接器的dentry对象(以及inode对象和文件对象)。
Gets the dentry object (as well as the inode object and the file object) of the dynamic linker.
检查动态链接器的执行权限。
Checks the execution permissions of the dynamic linker.
将动态链接器的前 128 个字节复制到缓冲区中。
Copies the first 128 bytes of the dynamic linker into a buffer.
对动态链接器类型执行一些一致性检查。
Performs some consistency checks on the dynamic linker type.
调用该flush_old_exec(
)函数释放之前计算使用的几乎所有资源;该函数依次执行以下操作:
如果信号处理程序表与其他进程共享,它会分配一个新表并减少旧表的使用计数器;此外,它将进程从旧线程组中分离出来(请参阅第 3 章中的“识别进程”部分)。所有这些都是通过调用该函数来完成的。de_thread( )
如果与其他进程共享,则调用以创建包含进程打开文件的结构unshare_files(
)的副本(请参阅第 12 章中的“与进程关联的文件”部分)。files_struct
使用可执行文件路径名设置comm进程描述符的字段。
通过将每个信号重置为其默认操作来更新信号处理程序表。这是通过调用该
flush_signal_handlers( )
函数来完成的。
调用该flush_old_files(
)函数来关闭所有在进程描述符集字段中具有相应标志的打开文件(请参阅第 12 章中的“与进程关联的文件”files->close_on_exec部分)。[ * ]
现在我们已经到了无法返回的地步:如果出现问题,函数将无法恢复之前的计算。
Invokes the flush_old_exec(
) function to release almost all resources used by the
previous computation; in turn, this function performs the following
operations:
If the table of signal handlers is shared with other
processes, it allocates a new table and decrements the usage
counter of the old one; moreover, it detaches the process from
the old thread group (see the section "Identifying a
Process" in Chapter
3). All of this is done by invoking the de_thread( ) function.
Invokes unshare_files(
) to make a copy of the files_struct structure containing the
open files of the process, if it is shared with other processes
(see the section "Files Associated with a
Process" in Chapter
12).
Invokes the exec_mmap(
) function to release the memory descriptor, all
memory regions , and all page frames assigned to the process and
to clean up the process's Page Tables.
Sets the comm field of
the process descriptor with the executable file pathname.
Invokes the flush_thread(
) function to clear the values of the floating point
registers and debug registers saved in the TSS segment.
Updates the table of signal handlers by resetting each
signal to its default action. This is done by invoking the
flush_signal_handlers( )
function.
Invokes the flush_old_files(
) function to close all open files having the
corresponding flag in the files->close_on_exec field of the
process descriptor set (see the section "Files Associated with a
Process" in Chapter
12).[*]
Now we have reached the point of no return: the function cannot restore the previous computation if something goes wrong.
清除PF_FORKNOEXEC
进程描述符中的标志。该标志在进程被分叉时设置,并在执行新程序时清除,是进程记账所必需的。
Clears the PF_FORKNOEXEC
flag in the process descriptor. This flag, which is set when a
process is forked and cleared when it executes a new program, is
required for process accounting.
设置进程的新个性,即
personality进程描述符中的字段。
Sets up the new personality of the process—that is, the
personality field in the process
descriptor.
调用arch_pick_mmap_layout(
)以选择进程内存区域的布局(请参阅本章前面的“程序段和进程内存区域”部分)。
Invokes arch_pick_mmap_layout(
) to select the layout of the memory regions of the
process (see the section "Program Segments and Process
Memory Regions" earlier in this chapter).
调用该setup_arg_pages(
)函数为进程的用户模式堆栈分配新的内存区域描述符,并将该内存区域插入到进程的地址空间中。setup_arg_pages( )还将包含命令行参数和环境变量字符串的页框分配给新的内存区域。
Invokes the setup_arg_pages(
) function to allocate a new memory region descriptor for
the process's User Mode stack and to insert that memory region into
the process's address space. setup_arg_pages( ) also assigns the page
frames containing the command-line arguments and the environment
variable strings to the new memory region.
调用该do_mmap( )
函数来创建一个新的内存区域,该区域映射可执行文件的文本段(即代码)。内存区域的初始线性地址取决于可执行格式,因为程序的可执行代码通常是不可重定位的。因此,该函数假设文本段是从某个特定的逻辑地址偏移量(因此从某个指定的线性地址)开始加载的。ELF 程序从线性地址开始加载0x08048000。
Invokes the do_mmap( )
function to create a new memory region that maps the text segment
(that is, the code) of the executable file. The initial linear
address of the memory region depends on the executable format,
because the program's executable code is usually not relocatable.
Therefore, the function assumes that the text segment is loaded
starting from some specific logical address offset (and thus from
some specified linear address). ELF programs are loaded starting
from linear address 0x08048000.
调用该do_mmap( )
函数创建一个新的内存区域,该区域映射可执行文件的数据段。同样,内存区域的初始线性地址取决于可执行格式,因为可执行代码期望在指定的偏移量(即指定的线性地址)处找到其变量。在 ELF 程序中,数据段紧接在文本段之后加载。
Invokes the do_mmap( )
function to create a new memory region that maps the data segment of
the executable file. Again, the initial linear address of the memory
region depends on the executable format, because the executable code
expects to find its variables at specified offsets (that is, at
specified linear addresses). In an ELF program, the data segment is
loaded right after the text segment.
为可执行文件的每个其他专用段分配额外的内存区域。通常情况下,没有。
Allocates additional memory regions for every other specialized segments of the executable file. Usually, there are none.
调用加载动态链接器的函数。如果动态链接器是 ELF 可执行文件,则该函数名为load_elf_interp( )。一般来说,该函数执行步骤 12 到 14 中的操作,但针对的是动态链接器而不是要执行的文件。包含动态链接器文本和数据的内存区域的初始地址由动态链接器本身指定;但是,它们非常高(通常高于0x40000000),以避免与映射要执行的文件的文本和数据的内存区域发生冲突(请参阅前面的部分“程序段和进程内存区域”)。
Invokes a function that loads the dynamic linker. If the
dynamic linker is an ELF executable, the function is named load_elf_interp( ). In general, the
function performs the operations in steps 12 through 14, but for the
dynamic linker instead of the file to be executed. The initial
addresses of the memory regions that will include the text and data
of the dynamic linker are specified by the dynamic linker itself;
however, they are very high (usually above 0x40000000) to avoid collisions with the
memory regions that map the text and data of the file to be executed
(see the earlier section "Program Segments and Process
Memory Regions").
在进程描述符的字段中存储可执行格式的对象binfmt的地址。linux_binfmt
Stores in the binfmt field
of the process descriptor the address of the linux_binfmt object of the executable
format.
确定流程的新功能。
Determines the new capabilities of the process.
创建特定的动态链接器表并将它们存储在命令行参数和环境字符串指针数组之间的用户模式堆栈中(参见图 20-1)。
Creates specific dynamic linker tables and stores them on the User Mode stack between the command-line arguments and the array of pointers to environment strings (see Figure 20-1).
设置进程内存描述符的start_code、end_code、start_data、end_data、start_brk、brk和字段的值。start_stack
Sets the values of the start_code, end_code, start_data, end_data, start_brk, brk, and start_stack fields of the process's memory
descriptor.
调用该do_brk( )
函数来创建映射程序的 bss 段的新匿名内存区域。(当进程写入变量时,会触发需求分页,从而分配页框。)该内存区域的大小是在链接可执行程序时计算的。必须指定内存区域的初始线性地址,因为程序的可执行代码通常是不可重定位的。在 ELF 程序中,bss 段紧接在数据段之后加载。
Invokes the do_brk( )
function to create a new anonymous memory region mapping the bss
segment of the program. (When the process writes into a variable, it
triggers demand paging , and thus the allocation of a page frame.) The size
of this memory region was computed when the executable program was
linked. The initial linear address of the memory region must be
specified, because the program's executable code is usually not
relocatable. In an ELF program, the bss segment is loaded right
after the data segment.
调用start_thread( )
宏来修改用户模式寄存器的值eip并esp保存在内核模式堆栈上,以便它们分别指向动态链接器的入口点和新用户模式堆栈的顶部。
Invokes the start_thread( )
macro to modify the values of the User Mode registers eip and esp saved on the Kernel Mode stack, so
that they point to the entry point of the dynamic linker and to the
top of the new User Mode stack, respectively.
如果正在跟踪进程,它会通知调试器系统调用已完成execve( )
。
If the process is being traced, it notifies the debugger about
the completion of the execve( )
system call.
返回值 0(成功)。
Returns the value 0 (success).
当execve( )系统调用终止并且调用进程在用户模式下恢复执行时,执行上下文将发生巨大变化:调用系统调用的代码不再存在。从这个意义上说,我们可以说,execve( )成功是一去不复返的。相反,要执行的新程序被映射到进程的地址空间中。
When the execve( ) system call
terminates and the calling process resumes its execution in User Mode,
the execution context is dramatically changed: the code that invoked the
system call no longer exists. In this sense, we could say that execve( ) never returns on success. Instead, a
new program to be executed is mapped in the address space of the
process.
但是,新程序尚无法执行,因为动态链接器仍必须负责加载共享库。[ * ]
However, the new program cannot yet be executed, because the dynamic linker must still take care of loading the shared libraries.[*]
尽管动态链接器在用户模式下运行,但我们在这里简要概述了它的运行方式。它的第一个工作是为自己设置一个基本的执行上下文,从内核存储在用户模式堆栈中的信息开始,这些信息位于指向环境字符串的指针数组和arg_start. 然后,动态链接器必须检查要执行的程序,以确定必须加载哪些共享库以及每个共享库中的哪些函数被有效请求。接下来,口译员发出几个
mmap( ) 用于创建映射将保存库函数的页面的内存区域的系统调用程序实际使用的(文本和数据)。然后解释器根据库内存区域的线性地址更新对共享库符号的所有引用。最后,动态链接器通过跳转到要执行的程序的主入口点来终止其执行。从现在开始,该进程将执行可执行文件和共享库的代码。
Although the dynamic linker runs in User Mode, we briefly sketch
out here how it operates. Its first job is to set up a basic execution
context for itself, starting from the information stored by the kernel
in the User Mode stack between the array of pointers to environment
strings and arg_start. Then the
dynamic linker must examine the program to be executed to identify which
shared libraries must be loaded and which functions in each shared
library are effectively requested. Next, the interpreter issues several
mmap( ) system calls to create memory regions mapping the pages
that will hold the library functions (text and data) actually used by the program. Then the
interpreter updates all references to the symbols of the shared library,
according to the linear addresses of the library's memory regions.
Finally, the dynamic linker terminates its execution by jumping to the
main entry point of the program to be executed. From now on, the process
will execute the code of the executable file and of the shared
libraries.
您可能已经注意到,执行程序是一项复杂的活动,涉及内核设计的许多方面,例如进程抽象、内存管理、系统调用和文件系统。这个话题会让你意识到 Linux 是多么了不起的作品!
As you may have noticed, executing a program is a complex activity that involves many facets of kernel design, such as process abstraction, memory management, system calls, and filesystems. It is the kind of topic that makes you realize what a marvelous piece of work Linux is!
[ * ]这些标志可以通过以下方式读取和修改
fcntl( ) 系统调用。
[*] These flags can be read and modified by means of the
fcntl( ) system call.
[ * ]如果可执行文件是静态链接的,也就是说,如果不请求共享库,事情就会简单得多。该load_binary方法只是将程序的文本、数据、bss 和堆栈段映射到进程内存区域,然后将用户模式eip寄存器设置为新程序的入口点。
[*] Things are much simpler if the executable file is statically
linked—that is, if no shared library is requested. The load_binary method simply maps the text,
data, bss, and stack segments of the program into the process memory
regions, and then sets the User Mode eip register to the entry point of the new
program.
本附录解释了用户打开计算机后会发生什么,即 Linux 内核映像如何复制到内存并执行。简而言之,我们讨论内核以及整个系统如何“引导”。
This appendix explains what happens right after users switch on their computers—that is, how a Linux kernel image is copied into memory and executed. In short, we discuss how the kernel, and thus the whole system, is "bootstrapped."
传统上, “bootstrap”一词指的是试图通过拉自己的靴子站起来的人。在操作系统中,该术语表示将操作系统的至少一部分放入主存储器并让处理器执行它。它还表示内核数据结构的初始化、一些用户进程的创建以及将控制权转移到其中之一。
Traditionally, the term bootstrap refers to a person who tries to stand up by pulling his own boots. In operating systems, the term denotes bringing at least a portion of the operating system into main memory and having the processor execute it. It also denotes the initialization of kernel data structures, the creation of some user processes, and the transfer of control to one of them.
计算机引导这是一项乏味且漫长的任务,因为最初,几乎每个硬件设备(包括 RAM)都处于随机的、不可预测的状态。而且,引导过程高度依赖于计算机体系结构;像往常一样,在本书中,我们指的是 80 × 86 架构。
Computer bootstrapping is a tedious, long task, because initially, nearly every hardware device, including the RAM, is in a random, unpredictable state. Moreover, the bootstrap process is highly dependent on the computer architecture; as usual in this book, we refer to the 80 × 86 architecture.
计算机开机后,几乎没有任何用处,因为 RAM 芯片包含随机数据,并且没有操作系统在运行。为了开始启动,一个特殊的硬件电路会升高 CPU 的 RESET 引脚的逻辑值。RESET 置位后,处理器的一些寄存器(包括cs和)被设置为固定值,并执行eip在物理地址找到的代码。0xfffffff0该地址由硬件映射到某个只读的持久存储器芯片,通常称为只读存储器(ROM)。存储在 ROM 中的程序集传统上称为基本输入/输出系统( BIOS))在80×86架构中,因为它包括几个中断驱动的低级程序,所有操作系统在启动阶段都使用这些程序来处理组成计算机的硬件设备。某些操作系统,例如 Microsoft 的 MS-DOS,依靠BIOS来实现大部分系统调用。
The moment after a computer is powered on, it is
practically useless because the RAM chips contain random data and no
operating system is running. To begin the boot, a special hardware
circuit raises the logical value of the RESET pin of the CPU. After
RESET is asserted, some registers of the processor (including cs and eip)
are set to fixed values, and the code found at physical address 0xfffffff0 is executed. This address is mapped
by the hardware to a certain read-only, persistent memory chip that is
often called Read-Only Memory (ROM). The set of programs stored in ROM
is traditionally called the Basic Input/Output
System (BIOS) in the 80 × 86
architecture, because it includes several interrupt-driven low-level
procedures used by all operating systems in the booting phase to handle
the hardware devices that make up the computer. Some operating systems,
such as Microsoft's MS-DOS , rely on BIOS to implement most system calls.
一旦进入保护模式(参见第 2 章中的“硬件分段” 部分),Linux 不再使用 BIOS,但它为计算机上的每个硬件设备提供自己的设备驱动程序。事实上,BIOS程序必须在实模式下执行,因此它们不能共享功能,即使这样做是有益的。
Once in protected mode (see the section "Segmentation in Hardware" in Chapter 2), Linux does not use BIOS any longer, but it provides its own device driver for every hardware device on the computer. In fact, the BIOS procedures must be executed in real mode, so they cannot share functions even if that would be beneficial.
BIOS 使用实模式地址,因为它们是计算机开启时唯一可用的地址。实模式地址由seg段和 off偏移组成;相应的物理地址由seg *16+ off给出。因此,CPU 寻址电路不需要全局描述符表、局部描述符表或分页表来将逻辑地址转换为物理地址。显然,初始化 GDT、LDT 和分页表的代码必须在实模式下运行。
The BIOS uses Real Mode addresses because they are the only ones available when the computer is turned on. A Real Mode address is composed of a seg segment and an off offset; the corresponding physical address is given by seg*16+off. As a result, no Global Descriptor Table, Local Descriptor Table, or paging table is needed by the CPU addressing circuit to translate a logical address into a physical one. Clearly, the code that initializes the GDT, LDT, and paging tables must run in Real Mode.
Linux 在引导阶段被迫使用 BIOS,此时它必须从磁盘或其他外部设备检索内核映像。BIOS 引导过程本质上执行以下四个操作:
Linux is forced to use BIOS in the bootstrapping phase, when it must retrieve the kernel image from disk or from some other external device. The BIOS bootstrap procedure essentially performs the following four operations:
对计算机硬件执行一系列测试,以确定存在哪些设备以及它们是否正常工作。此阶段通常称为加电自检( POST )。在此阶段,会显示几条消息,例如 BIOS 版本横幅。
最新的 80 × 86、AMD64 和 Itanium 计算机利用 高级配置和电源接口( ACPI ) 标准。符合 ACPI 的 BIOS 中的引导程序代码会构建多个表来描述系统中存在的硬件设备。这些表具有独立于供应商的格式,并且可以由操作系统内核读取以了解如何处理设备。
Executes a series of tests on the computer hardware to establish which devices are present and whether they are working properly. This phase is often called Power-On Self-Test (POST). During this phase, several messages, such as the BIOS version banner, are displayed.
Recent 80 × 86, AMD64, and Itanium computers make use of the Advanced Configuration and Power Interface(ACPI ) standard. The bootstrap code in an ACPI-compliant BIOS builds several tables that describe the hardware devices present in the system. These tables have a vendor-independent format and can be read by the operating system kernel to learn how to handle the devices.
初始化硬件设备。这个阶段在现代基于 PCI 的架构中至关重要,因为它保证所有硬件设备在 IRQ 线和 I/O 端口上运行时不会发生冲突。在此阶段结束时,将显示已安装 PCI 设备的表。
Initializes the hardware devices. This phase is crucial in modern PCI-based architectures, because it guarantees that all hardware devices operate without conflicts on the IRQ lines and I/O ports. At the end of this phase, a table of installed PCI devices is displayed.
搜索要引导的操作系统。实际上,根据 BIOS 设置,该过程可能会尝试访问(以预定义的、可定制的顺序)系统中每个软盘、硬盘和 CD-ROM 的第一个扇区(引导扇区)。
Searches for an operating system to boot. Actually, depending on the BIOS setting, the procedure may try to access (in a predefined, customizable order) the first sector (boot sector) of every floppy disk, hard disk, and CD-ROM in the system.
一旦找到有效的设备,它就会将其第一个扇区的内容从物理地址开始复制到 RAM 中0x00007c00,然后跳转到该地址并执行刚刚加载的代码。
As soon as a valid device is found, it copies the contents of
its first sector into RAM, starting from physical address 0x00007c00, and then jumps into that
address and executes the code just loaded.
本附录的其余部分将带您从最原始的起始状态到运行 Linux 系统的全部辉煌。
The rest of this appendix takes you from the most primitive starting state to the full glory of a running Linux system.
引导加载程序是由 BIOS 调用的将操作系统内核映像加载到 RAM 中的程序。让我们简要概述一下引导加载程序是如何进行的在 IBM 的 PC 架构中工作。
The boot loader is the program invoked by the BIOS to load the image of an operating system kernel into RAM. Let's briefly sketch how boot loaders work in IBM's PC architecture.
要从软盘启动,存储在其第一个扇区中的指令将被加载到 RAM 中并执行;这些指令将包含内核映像的所有剩余扇区复制到 RAM 中。
To boot from a floppy disk, the instructions stored in its first sector are loaded in RAM and executed; these instructions copy all the remaining sectors containing the kernel image into RAM.
从硬盘启动的方式有所不同。硬盘的第一个扇区称为主引导记录 (MBR),包括分区表[ * ]和一个小程序,该程序加载包含要启动的操作系统的分区的第一个扇区。某些操作系统,例如 Microsoft Windows98、通过 分区表中包含的活动标志来识别该分区;[ † ]按照这种方法,只有内核映像存储在活动分区中的操作系统可以启动。正如我们稍后将看到的,Linux 更加灵活,因为它用复杂的程序(“引导加载程序”)取代了 MBR 中包含的基本程序,该程序允许用户选择要引导的操作系统。
Booting from a hard disk is done differently. The first sector of the hard disk, named the Master Boot Record (MBR), includes the partition table[*] and a small program, which loads the first sector of the partition containing the operating system to be started. Some operating systems, such as Microsoft Windows 98, identify this partition by means of an active flag included in the partition table;[†] following this approach, only the operating system whose kernel image is stored in the active partition can be booted. As we will see later, Linux is more flexible because it replaces the rudimentary program included in the MBR with a sophisticated program—the "boot loader"—that allows users to select the operating system to be booted.
早期 Linux 版本(直至 2.4 系列)的内核映像在前 512 字节中包含一个最小的“引导加载程序”程序;因此,从第一个扇区开始复制内核映像使软盘可引导。另一方面,Linux 2.6 的内核映像不再包含此类引导加载程序;因此,为了从软盘引导,必须在第一个磁盘扇区中存储合适的引导加载程序。如今,从软盘启动与从硬盘或 CD-ROM 启动非常相似。
Kernel images of earlier Linux versions—up to the 2.4 series—included a minimal "boot loader" program in the first 512 bytes; thus, copying a kernel image starting from the first sector made the floppy bootable. On the other hand, kernel images of Linux 2.6 no longer include such boot loader; thus, in order to boot from floppy disk, a suitable boot loader has to be stored in the first disk sector. Nowadays, booting from a floppy is very similar to booting from a hard disk or from a CD-ROM.
从磁盘引导 Linux 内核需要两阶段引导加载程序。80 × 86 系统上著名的 Linux 引导加载程序称为 LInux LOader (LILO)。确实存在其他用于 80 × 86 系统的引导加载程序;例如,GRand Unified Bootloader (GRUB) 也被广泛使用。GRUB 比 LILO 更先进,因为它可以识别多个基于磁盘的文件系统,因此能够从文件中读取部分引导程序。当然,Linux 支持的所有体系结构都存在特定的引导加载程序。
A two-stage boot loader is required to boot a Linux kernel from disk. A well-known Linux boot loader on 80 × 86 systems is named LInux LOader (LILO). Other boot loaders for 80 × 86 systems do exist; for instance, the GRand Unified Bootloader (GRUB) is also widely used. GRUB is more advanced than LILO, because it recognizes several disk-based filesystems and is thus capable of reading portions of the boot program from files. Of course, specific boot loader programs exist for all architectures supported by Linux.
LILO 可以安装在 MBR(替换加载活动分区引导扇区的小程序)上,也可以安装在每个磁盘分区的引导扇区中。在这两种情况下,最终的结果是相同的:当加载程序在引导时执行时,用户可以选择加载哪个操作系统。
LILO may be installed either on the MBR (replacing the small program that loads the boot sector of the active partition) or in the boot sector of every disk partition. In both cases, the final result is the same: when the loader is executed at boot time, the user may choose which operating system to load.
实际上,LILO 引导加载程序太大,无法放入单个扇区,因此它被分成两部分。MBR 或分区引导扇区包括一个小型引导加载程序,它由0x00007c00BIOS 从该地址开始加载到 RAM 中。这个小程序将自身移动到地址0x00096a00,设置实模式堆栈(范围从0x00098000到
0x000969ff),将 LILO 引导加载程序的第二部分加载到从地址 开始的 RAM 中0x00096c00,然后跳转到其中。
Actually, the LILO boot loader is too large to fit into a single
sector, thus it is broken into two parts. The MBR or the partition
boot sector includes a small boot loader, which is loaded into RAM
starting from address 0x00007c00 by
the BIOS. This small program moves itself to the address 0x00096a00, sets up the Real Mode stack
(ranging from 0x00098000 to
0x000969ff), loads the second part
of the LILO boot loader into RAM starting from address 0x00096c00, and jumps into it.
反过来,后一个程序从磁盘读取可启动操作系统的映射,并向用户提供提示,以便她可以选择其中一个。最后,在用户选择要加载的内核后(或者让超时时间过去,以便 LILO 选择默认值),引导加载程序可以将相应分区的引导扇区复制到 RAM 中并执行它,也可以直接复制内核映像存入 RAM。
In turn, this latter program reads a map of bootable operating systems from disk and offers the user a prompt so she can choose one of them. Finally, after the user has chosen the kernel to be loaded (or let a time-out elapse so that LILO chooses a default), the boot loader may either copy the boot sector of the corresponding partition into RAM and execute it or directly copy the kernel image into RAM.
假设必须引导 Linux 内核映像,则依赖于 BIOS 例程的 LILO 引导加载程序主要执行以下操作:
Assuming that a Linux kernel image must be booted, the LILO boot loader, which relies on BIOS routines, performs essentially the following operations:
调用 BIOS 过程以显示“正在加载”消息。
Invokes a BIOS procedure to display a "Loading" message.
调用 BIOS 过程从磁盘加载内核映像的初始部分:内核映像的前 512 字节放入 RAM 中的地址 处,0x00090000而函数代码
setup( )(见下文)则放入从地址 开始的 RAM 中0x00090200。
Invokes a BIOS procedure to load an initial portion of the
kernel image from disk: the first 512 bytes of the kernel image
are put in RAM at address 0x00090000, while the code of the
setup( ) function (see below)
is put in RAM starting from address 0x00090200.
调用 BIOS 过程从磁盘加载内核映像的其余部分,并将映像从低地址0x00010000(对于使用 编译的小内核映像make
zImage)或高地址0x00100000(对于使用 编译的大内核映像make bzImage)开始放入 RAM 中。在下面的讨论中,我们分别说内核映像在 RAM 中“低加载”或“高加载”。对大内核映像的支持本质上使用与其他启动方案相同的引导方案,但它将数据放置在不同的物理内存地址中,以避免第 2 章“物理内存布局”部分中提到的ISA漏洞问题。
Invokes a BIOS procedure to load the rest of the kernel
image from disk and puts the image in RAM starting from either low
address 0x00010000 (for small
kernel images compiled with make
zImage) or high address 0x00100000 (for big kernel images
compiled with make bzImage). In
the following discussion, we say that the kernel image is "loaded
low" or "loaded high" in RAM, respectively. Support for big kernel
images uses essentially the same booting scheme as the other one,
but it places data in different physical memory addresses to avoid
problems with the ISA hole mentioned in the section "Physical Memory
Layout" in Chapter
2.
跳转到setup( )
代码。
Jumps to the setup( )
code.
汇编语言函数的代码setup(
)已被链接器放置在0x200内核映像文件的偏移量处。因此,引导加载程序可以轻松地找到代码并将其从物理地址开始复制到 RAM 中0x00090200。
The code of the setup(
) assembly language function has been placed by the linker at
offset 0x200 of the kernel image
file. The boot loader can therefore easily locate the code and copy it
into RAM, starting from physical address 0x00090200.
该setup( )函数必须初始化计算机中的硬件设备并设置内核程序的执行环境。虽然BIOS已经初始化了大部分硬件设备,但Linux并不依赖它,而是以自己的方式重新初始化设备,以增强可移植性和健壮性。setup( )主要执行以下操作:
The setup( ) function must
initialize the hardware devices in the computer and set up the
environment for the execution of the kernel program. Although the BIOS
already initialized most hardware devices, Linux does not rely on it,
but reinitializes the devices in its own manner to enhance portability
and robustness. setup( ) performs
essentially the following operations:
在ACPI中- 兼容系统,它调用 BIOS 例程,在 RAM 中构建一个表,描述系统物理内存的布局(通过查找“BIOS-e820”标签可以在启动内核消息中看到该表)。在较旧的系统中,它调用一个 BIOS 例程,该例程仅返回系统中可用的 RAM 量。
In ACPI -compliant systems, it invokes a BIOS routine that builds a table in RAM describing the layout of the system's physical memory (the table can be seen in the boot kernel messages by looking for the "BIOS-e820" label). In older systems, it invokes a BIOS routine that just returns the amount of RAM available in the system.
设置键盘重复延迟和速率。(当用户按住某个键超过一定时间时,键盘设备会一遍又一遍地向 CPU 发送相应的键码。)
Sets the keyboard repeat delay and rate. (When the user keeps a key pressed past a certain amount of time, the keyboard device sends the corresponding keycode over and over to the CPU.)
初始化视频适配器卡。
Initializes the video adapter card.
重新初始化磁盘控制器并确定硬盘参数。
Reinitializes the disk controller and determines the hard disk parameters.
检查 IBM Micro Channel 总线 (MCA)。
Checks for an IBM Micro Channel bus (MCA).
检查 PS/2 指针设备(总线鼠标)。
Checks for a PS/2 pointing device (bus mouse).
如果 BIOS 支持增强磁盘驱动器服务 (预展期 ),它调用正确的 BIOS 过程在 RAM 中构建一个表来描述系统中可用的硬盘。(表中包含的信息可以通过读取sysfs的firmware/edd目录下的文件看到 特殊文件系统。)
If the BIOS supports the Enhanced Disk Drive Services (EDD ), it invokes the proper BIOS procedure to build a table in RAM describing the hard disks available in the system. (The information included in the table can be seen by reading the files in the firmware/edd directory of the sysfs special filesystem.)
如果内核映像加载到 RAM 中的较低位置(位于物理地址
0x00010000),则该函数会将其移动到物理地址0x00001000。相反,如果内核映像已加载到 RAM 的高位,则该函数不会移动它。此步骤是必要的,因为为了能够将内核映像存储在软盘上并减少启动时间,存储在磁盘上的内核映像被压缩,并且解压缩例程需要一些可用空间作为内核之后的临时缓冲区RAM 中的图像。
If the kernel image was loaded low in RAM (at physical address
0x00010000), the function moves
it to physical address 0x00001000. Conversely, if the kernel
image was loaded high in RAM, the function does not move it. This
step is necessary because to be able to store the kernel image on a
floppy disk and to reduce the booting time, the kernel image stored
on disk is compressed, and the decompression routine needs some free
space to use as a temporary buffer following the kernel image in
RAM.
设置位于 8042 键盘控制器上的 A20 引脚。A20 引脚是 80286 中引入的 hack使物理地址与古代 8088 的物理地址兼容的系统微处理器。不幸的是,在切换到保护模式之前必须正确设置A20引脚,否则每个物理地址的第21位将始终被CPU视为零。设置 A20 引脚是一个混乱的操作。
Sets the A20 pin located on the 8042 keyboard controller. The A20 pin is a hack introduced in the 80286 -based systems to make physical addresses compatible with those of the ancient 8088 microprocessors. Unfortunately, the A20 pin must be properly set before switching to Protected Mode, otherwise the 21st bit of every physical address will always be regarded as zero by the CPU. Setting the A20 pin is a messy operation.
设置临时中断描述符表(IDT)和临时全局描述符表(GDT)。
Sets up a provisional Interrupt Descriptor Table (IDT) and a provisional Global Descriptor Table (GDT).
对可编程中断控制器 (PIC) 重新编程以屏蔽所有中断,但 IRQ2 除外,IRQ2 是两个 PIC 之间的级联中断。
Reprograms the Programmable Interrupt Controllers (PIC) to mask all interrupts, except IRQ2 which is the cascading interrupt between the two PICs.
PE通过设置位中的位将CPU从实模式切换到保护模式cr0 状态寄存器。PG寄存器中的位被cr0清除,因此分页仍然被禁用。
Switches the CPU from Real Mode to Protected Mode by setting
the PE bit in the cr0 status register. The PG bit in the cr0 register is cleared, so paging is
still disabled.
跳转到startup_32( )
汇编语言函数。
Jumps to the startup_32( )
assembly language function.
有两种不同的startup_32( )功能;我们这里提到的代码是在arch/i386/boot/compressed/head.S中 文件。终止后setup( )
,函数已被移动到物理地址
0x00100000或物理地址
0x00001000,具体取决于内核映像是加载到 RAM 的高位还是低位。
There are two different startup_32( ) functions; the one we refer to
here is coded in the arch/i386/boot/compressed/head.S file. After setup( )
terminates, the function has been moved either to physical address
0x00100000 or to physical address
0x00001000, depending on whether the
kernel image was loaded high or low in RAM.
该函数执行以下操作:
This function performs the following operations:
初始化分段寄存器和临时堆栈。
Initializes the segmentation registers and a provisional stack.
Fills the area of uninitialized data of the kernel identified
by the _edata and _end symbols with zeros (see the section
"Physical Memory
Layout" in Chapter
2).
调用decompress_kernel(
)函数解压内核镜像。首先显示“正在解压 Linux...”消息。内核镜像解压后,出现“OK,启动内核”。显示消息。如果内核映像加载较低,则解压后的内核将放置在物理地址 处0x00100000。否则,如果内核映像加载得较高,则解压缩的内核将被放置在位于压缩映像之后的临时缓冲区中。然后,解压缩的图像被移至其最终位置,该位置从物理地址 开始
0x00100000。
Invokes the decompress_kernel(
) function to decompress the kernel image. The
"Uncompressing Linux..." message is displayed first. After the
kernel image is decompressed, the "O K, booting the kernel." message
is shown. If the kernel image was loaded low, the decompressed
kernel is placed at physical address 0x00100000. Otherwise, if the kernel image
was loaded high, the decompressed kernel is placed in a temporary
buffer located after the compressed image. The decompressed image is
then moved into its final position, which starts at physical address
0x00100000.
跳转到物理地址0x00100000。
Jumps to physical address 0x00100000.
解压后的内核映像以arch/i386/kernel/head.Sstartup_32( )中包含的另一个函数
开始 文件。对这两个函数使用相同的名称不会产生任何问题(除了让我们的读者感到困惑),因为这两个函数都是通过跳转到其初始物理地址来执行的。
The decompressed kernel image begins with another startup_32( ) function included in the
arch/i386/kernel/head.S file. Using the same name for both the functions does not
create any problems (besides confusing our readers), because both
functions are executed by jumping to their initial physical
addresses.
第二个startup_32( )
函数为第一个 Linux 进程(进程 0)设置执行环境。该函数执行以下操作:
The second startup_32( )
function sets up the execution environment for the first Linux process
(process 0). The function performs the following operations:
使用分段寄存器的最终值初始化它们。
Initializes the segmentation registers with their final values.
Fills the bss segment of the kernel (see the section "Program Segments and Process Memory Regions" in Chapter 20) with zeros.
初始化临时内核页表,并将
swapper_pg_dir线性pg0地址相同地映射到相同的物理地址,如第 2 章“内核页表”部分所述。
Initializes the provisional kernel Page Tables contained in
swapper_pg_dir and pg0 to identically map the linear
addresses to the same physical addresses, as explained in the
section "Kernel Page
Tables" in Chapter
2.
Stores the address of the Page Global Directory in the
cr3 register, and enables paging by setting the PG bit in the cr0 register.
Sets up the Kernel Mode stack for process 0 (see the section "Kernel Threads" in Chapter 3).
该函数再次清除寄存器中的所有位eflags。
Once again, the function clears all bits in the eflags register.
调用以空中断处理程序填充 IDT(请参阅第 4 章中的“ IDT 的初步初始化”setup_idt( )部分)。
Invokes setup_idt( ) to
fill the IDT with null interrupt handlers (see the section "Preliminary Initialization of
the IDT" in Chapter
4).
Puts the system parameters obtained from the BIOS and the parameters passed to the operating system into the first page frame (see the section "Physical Memory Layout" in Chapter 2).
标识处理器的型号。
Identifies the model of the processor.
Loads the gdtr and idtr
registers with the addresses of the GDT and IDT
tables.
跳转到start_kernel(
)函数。
Jumps to the start_kernel(
) function.
该start_kernel( )
函数完成Linux内核的初始化。几乎每个内核组件都由该函数初始化;我们仅提及其中的几个:
The start_kernel( )
function completes the initialization of the Linux kernel. Nearly every
kernel component is initialized by this function; we mention just a few
of them:
调度程序通过调用该sched_init( )函数来初始化(参见第 7 章)。
The scheduler is initialized by invoking the sched_init( ) function (see Chapter 7).
The memory zones are initialized by invoking the build_all_zonelists( ) function (see the
section "Memory
Zones" in Chapter
8).
Buddy 系统分配器通过调用
page_alloc_init( )和函数来初始化(参见第 8 章中的“ Buddy 系统算法”mem_init( )部分)。
The Buddy system allocators are initialized by invoking the
page_alloc_init( ) and mem_init( ) functions (see the section
"The Buddy System
Algorithm" in Chapter
8).
IDT的最终初始化是通过调用
(参见第4章“异常处理”trap_init( )部分)和(参见第4章“ IRQ数据结构”部分)来执行的。init_IRQ( )
The final initialization of the IDT is performed by invoking
trap_init( ) (see the section
"Exception
Handling" in Chapter
4) and init_IRQ( ) (see
the section "IRQ data
structures" in Chapter
4).
和是通过调用函数来初始化的(参见第 4 章TASKLET_SOFTIRQ中的
“ Softirqs ”
部分)。HI_SOFTIRQsoftirq_init( )
The TASKLET_SOFTIRQ and
HI_SOFTIRQ are initialized by
invoking the softirq_init( )
function (see the section "Softirqs" in Chapter 4).
系统日期和时间由该函数初始化(参见第 6 章中的“ Linux 计时体系结构”time_init( )部分)。
The system date and time are initialized by the time_init( ) function (see the section
"The Linux Timekeeping
Architecture" in Chapter
6).
The slab allocator is initialized by the kmem_cache_init( ) function (see the
section "General and
Specific Caches" in Chapter 8).
The speed of the CPU clock is determined by invoking the
calibrate_delay( ) function (see
the section "Delay
Functions" in Chapter
6).
进程1的内核线程是通过调用该
kernel_thread( )函数创建的。反过来,该内核线程创建其他内核线程并执行/sbin/init程序,如第 3 章“内核线程”部分所述。
The kernel thread for process 1 is created by invoking the
kernel_thread( ) function. In
turn, this kernel thread creates the other kernel threads and executes the /sbin/init program, as described in the
section "Kernel
Threads" in Chapter
3.
除了在 开始之后立即显示的“Linux version 2.6.11...”消息之外,init程序和内核线程start_kernel(
)也会在最后阶段显示许多其他消息。最后,熟悉的登录提示出现在控制台上(或者图形屏幕上,如果 X Window 系统在启动时启动),告诉用户 Linux 内核已启动并正在运行。
Besides the "Linux version 2.6.11..." message, which is displayed
right after the beginning of start_kernel(
), many other messages are displayed in this last phase, both
by the init program and by the kernel threads. At
the end, the familiar login prompt appears on the console (or in the
graphical screen, if the X Window System is launched at startup), telling the user that the Linux
kernel is up and running.
当系统程序员想要向 Linux 内核添加新功能时,他们面临一个基本决定:应该编写新代码以便将其编译为模块,还是应该将新代码静态链接到内核?
When system programmers want to add new functionality to the Linux kernel, they are faced with a basic decision: should they write the new code so that it will be compiled as a module, or should they statically link the new code to the kernel?
作为一般规则,系统程序员倾向于将新代码实现为模块。因为模块可以按需链接(正如我们稍后看到的),所以内核不必因数百个很少使用的程序而变得臃肿。Linux 内核的几乎每个高级组件(文件系统、设备驱动程序、可执行格式、网络层等)都可以编译为模块。Linux 发行版广泛使用模块,以便以无缝方式支持各种硬件设备。例如,该发行版将数十个声卡驱动程序模块放在适当的目录中,尽管只有其中一个模块会有效地加载到特定计算机上。
As a general rule, system programmers tend to implement new code as a module. Because modules can be linked on demand (as we see later), the kernel does not have to be bloated with hundreds of seldom-used programs. Nearly every higher-level component of the Linux kernel—filesystems, device drivers, executable formats, network layers, and so on—can be compiled as a module. Linux distributions use modules extensively in order to support in a seamless way a wide range of hardware devices. For instance, the distribution puts tens of sound card driver modules in a proper directory, although only one of these modules will be effectively loaded on a specific machine.
然而,某些Linux代码必须静态链接,这意味着相应的组件要么包含在内核中,要么根本不编译。当组件需要修改内核中静态链接的某些数据结构或函数时,通常会发生这种情况。
However, some Linux code must necessarily be linked statically, which means that either the corresponding component is included in the kernel or it is not compiled at all. This happens typically when the component requires a modification to some data structure or function statically linked in the kernel.
例如,假设组件必须将新字段引入到进程描述符中。链接模块不能更改已定义的数据结构,例如task_struct因为,即使模块使用其数据结构的修改版本,所有静态链接代码仍然会看到旧版本。很容易发生数据损坏。该问题的部分解决方案包括“静态”地将新字段添加到进程描述符中,从而使它们可供内核组件使用,无论其如何链接。但是,如果从未使用内核组件,则在每个进程描述符中复制的此类额外字段会浪费内存。如果新的内核组件大大增加了进程描述符的大小,只有当该组件静态链接到内核时,通过在数据结构中添加所需的字段才能获得更好的系统性能。
For example, suppose the component has to introduce new fields
into the process descriptor. Linking a module cannot change an already
defined data structure such as task_struct because, even if the module uses
its modified version of the data structure, all statically linked code
continues to see the old version. Data corruption easily occurs. A
partial solution to the problem consists of "statically" adding the new
fields to the process descriptor, thus making them available to the
kernel component no matter how it has been linked. However, if the
kernel component is never used, such extra fields replicated in every
process descriptor are a waste of memory. If the new kernel component
increases the size of the process descriptor a lot, one would get better
system performance by adding the required fields in the data structure
only if the component is statically linked to the kernel.
作为第二个示例,考虑必须替换静态链接代码的内核组件。很明显,这样的组件不能编译为模块,因为链接模块时内核无法更改 RAM 中已有的机器代码。例如,不可能链接改变页框分配方式的模块,因为 Buddy 系统函数始终静态链接到内核。[ * ]
As a second example, consider a kernel component that has to replace statically linked code. It's pretty clear that no such component can be compiled as a module, because the kernel cannot change the machine code already in RAM when linking the module. For instance, it is not possible to link a module that changes the way page frames are allocated, because the Buddy system functions are always statically linked to the kernel.[*]
内核在管理模块时需要执行两个关键任务。第一个任务是确保内核的其余部分可以到达模块的全局符号,例如其主函数的入口点。模块还必须知道内核和其他模块中符号的地址。因此,当链接模块时,引用将被一次性解析。第二个任务包括跟踪模块的使用情况,以便在另一个模块或内核的其他部分正在使用它时不会卸载任何模块。一个简单的引用计数可以跟踪每个模块的使用情况。
The kernel has two key tasks to perform in managing modules. The first task is making sure the rest of the kernel can reach the module's global symbols, such as the entry point to its main function. A module must also know the addresses of symbols in the kernel and in other modules. Thus, references are resolved once and for all when a module is linked. The second task consists of keeping track of the use of modules, so that no module is unloaded while another module or another part of the kernel is using it. A simple reference count keeps track of each module's usage.
Linux 内核(GPL,版本 2)的许可证对于用户和行业可以使用源代码做什么是自由的;然而,它严格禁止在非 GPL 许可下发布源自或严重依赖于 Linux 代码的源代码。本质上,内核开发人员希望确保他们的代码(以及从中派生的所有代码)将可供所有用户免费使用。
The license of the Linux kernel (GPL, version 2) is liberal in what users and industries can do with the source code; however, it strictly forbids the release of source code derived from—or heavily depending on—the Linux code under a non-GPL license. Essentially, the kernel developers want to be sure that their code—and all the code derived from it—will remain freely usable by all users.
然而,模块对此模型构成了威胁。有人可能只以二进制形式发布 Linux 内核的模块;例如,供应商可能会在纯二进制模块中分发其硬件设备的驱动程序。如今,此类做法的例子不少。从理论上讲,纯二进制模块可能会显着改变 Linux 内核的特性和行为,从而有效地将 Linux 派生的内核转变为商业产品。
Modules, however, pose a threat to this model. Someone might release a module for the Linux kernel in binary form only; for instance, a vendor might distribute the driver for its hardware device in a binary-only module. Nowadays, there are quite a few examples of these practices. Theoretically, characteristics and behavior of the Linux kernel might be significantly changed by binary-only modules, thus effectively turning a Linux-derived kernel in a commercial product.
因此,纯二进制模块并未受到 Linux 内核开发者社区的欢迎。Linux 模块的实现反映了这一事实。基本上,每个模块开发人员都应该使用宏在模块源代码中指定许可证类型MODULE_LICENSE。如果许可证不兼容 GPL(或者根本没有指定),则该模块将无法使用内核的许多核心功能和数据结构。此外,使用具有非 GPL 许可证的模块会“污染”内核,这意味着内核开发人员不会考虑内核中任何假定的错误。
Thus, binary-only modules are not well received by the Linux
kernel developer community. The implementation of Linux modules
reflect this fact. Basically, each module developer should specify in
the module source code the type of license, by using the MODULE_LICENSE macro. If the license is not
GPL-compatible (or it is not specified at all), the module will not be
able to make use of many core functions and data structures of the
kernel. Moreover, using a module with a non-GPL license will "taint"
the kernel, which means that any supposed bug in the kernel will not
be taken in consideration by the kernel developers.
[ * ]您可能想知道为什么您最喜欢的内核组件没有被模块化。在大多数情况下,没有强有力的技术原因,因为它本质上是软件许可问题。内核开发人员希望确保核心组件永远不会被通过纯二进制“黑盒”模块发布的专有代码所取代。
[*] You might wonder why your favorite kernel component has not been modularized. In most cases, there is no strong technical reason because it is essentially a software license issue. Kernel developers want to make sure that core components will never be replaced by proprietary code released through binary-only "black-box" modules.
模块作为 ELF 目标文件存储在文件系统中,并通过执行insmod程序链接到内核(请参阅后面的部分“链接和取消链接模块”)。对于每个模块,内核分配一个包含以下数据的内存区域:
Modules are stored in the filesystem as ELF object files and are linked to the kernel by executing the insmod program (see the later section, "Linking and Unlinking Modules"). For each module, the kernel allocates a memory area containing the following data:
一个module物体
A module object
表示模块名称的以 null 结尾的字符串(所有模块必须具有唯一的名称)
A null-terminated string that represents the name of the module (all modules must have unique names)
实现模块功能的代码
The code that implements the functions of the module
该module对象描述了一个模块;其字段如表B-1所示。双向链表收集所有module
对象;列表头存储在modules变量中,而指向相邻元素的指针存储在list每个module对象的字段中。
The module object describes a
module; its fields are shown in Table B-1. A doubly linked
circular list collects all module
objects; the list head is stored in the modules variable, while the pointers to the
adjacent elements are stored in the list field of each module object.
表 B-1。模块对象
Table B-1. The module object
该state字段对模块的内部状态进行编码:它可以是MODULE_STATE_LIVE(模块处于活动状态)、
MODULE_STATE_COMING(模块正在初始化)和MODULE_STATE_GOING(模块正在被删除)。
The state field encodes the
internal state of the module: it can be MODULE_STATE_LIVE (the module is active),
MODULE_STATE_COMING (the module is
being initialized), and MODULE_STATE_GOING (the module is being
removed).
正如第 10 章“动态地址检查:修复代码”一节中已经提到的,每个模块都有自己的异常表。该表包括模块的修复代码的地址(如果有)。当模块链接时,该表被复制到 RAM 中,并且其起始地址存储在对象的字段
中。extablemodule
As already mentioned in the section "Dynamic Address Checking: The
Fix-up Code" in Chapter
10, each module has its own exception table. The table includes
the addresses of the fixup code of the module, if any. The table is
copied into RAM when the module is linked, and its starting address is
stored in the extable field of the
module object.
每个模块都有一组使用计数器,每个CPU一个,存储在ref相应module对象的字段中。当涉及模块功能的操作开始时,计数器增加,当操作终止时,计数器减少。仅当所有使用计数器的总和为 0 时,模块才能取消链接。
Each module has a set of usage counters, one for each CPU,
stored in the ref field of the
corresponding module object. The
counter is increased when an operation involving the module's
functions is started and decreased when the operation terminates. A
module can be unlinked only if the sum of all usage counters is
0.
例如,假设 MS-DOS 文件系统层被编译为模块,并且该模块在运行时链接。最初,模块使用计数器被设置为 0。如果用户安装 MS-DOS 软盘,则模块使用计数器之一会增加 1。相反,当用户卸载软盘时,其中一个计数器 — 甚至与增加的计数器不同—减1。模块的总使用计数器是所有CPU计数器的总和。
For example, suppose that the MS-DOS filesystem layer is compiled as a module and the module is linked at runtime. Initially, the module usage counters are set to 0. If the user mounts an MS-DOS floppy disk, one of the module usage counters is increased by 1. Conversely, when the user unmounts the floppy disk, one of the counters—even different from the one that was increased—is decreased by 1. The total usage counter of the module is the sum of all CPU counters.
链接模块时,模块目标代码中对全局内核符号(变量和函数)的所有引用都必须替换为合适的地址。这个操作与链接器在编译用户模式程序时执行的操作非常相似(参见第 20 章中的“库”部分),被委托给 insmod外部程序(稍后在“链接和取消链接模块”部分中描述) )。
When linking a module, all references to global kernel symbols (variables and functions) in the module's object code must be replaced with suitable addresses. This operation, which is very similar to that performed by the linker while compiling a User Mode program (see the section "Libraries" in Chapter 20), is delegated to the insmod external program (described later in the section "Linking and Unlinking Modules").
一些特殊的内核符号表
内核使用它们来存储模块可以访问的符号及其对应的地址。它们包含在内核代码段的三个部分中:该
_ _kstrtab部分包括符号的名称,该_
_ksymtab部分包括可以被所有类型的模块使用的符号的地址,该_
_ksymtab_gpl部分包括可以被使用的符号的地址。由在 GPL 兼容许可证下发布的模块使用。当在静态链接的内核代码中使用宏EXPORT_SYMBOL和EXPORT_SYMBOL_GPL宏时,会强制 C 编译器分别向_
_ksymtab和_
_ksymtab_gpl部分添加指定的符号。
Some special kernel symbol tables
are used by the kernel to store the symbols that can be
accessed by modules together with their corresponding addresses. They
are contained in three sections of the kernel code segment: the
_ _kstrtab section includes the
names of the symbols, the _
_ksymtab section includes the addresses of the symbols that
can be used by all kind of modules, and the _
_ksymtab_gpl section includes the addresses of the symbols
that can be used by the modules released under a GPL-compatible
license. The EXPORT_SYMBOL macro
and the EXPORT_SYMBOL_GPL macro,
when used inside the statically linked kernel code, force the C
compiler to add a specified symbol to the _
_ksymtab and _
_ksymtab_gpl sections, respectively.
表中仅包含某些现有模块实际使用的内核符号。如果系统程序员需要在某个模块内访问尚未导出的内核符号,他只需将相应的EXPORT_SYMBOL_GPL宏添加到 Linux 源代码中即可。当然,他不能合法地为许可证不兼容 GPL 的模块导出新符号。
Only the kernel symbols actually used by some existing module
are included in the table. Should a system programmer need, within
some module, to access a kernel symbol that is not already exported,
he can simply add the corresponding EXPORT_SYMBOL_GPL macro into the Linux
source code. Of course, he cannot legally export a new symbol for a
module whose license is not GPL-compatible.
链接模块还可以导出自己的符号,以便其他模块可以访问它们。模块符号表
包含在模块代码段的_
_ksymtab、_ _ksymtab_gpl、 和部分中。_ _kstrtab要从模块导出符号子集,程序员可以使用上述和 宏EXPORT_SYMBOL。EXPORT_SYMBOL_GPL当模块链接时,模块导出的符号被复制到两个内存数组中,它们的地址存储在
对象的syms和字段中。gpl_symsmodule
Linked modules can also export their own symbols so that other
modules can access them. The module symbol tables
are contained in the _
_ksymtab, _ _ksymtab_gpl,
and _ _kstrtab sections of the
module code segment. To export a subset of symbols from the module,
the programmer can use the EXPORT_SYMBOL and EXPORT_SYMBOL_GPL macros described above.
The exported symbols of the module are copied into two memory arrays
when the module is linked, and their addresses are stored in the
syms and gpl_syms fields of the module object.
模块(B)可以引用另一个模块(A)导出的符号;在这种情况下,我们说 B 被加载到 A 之上,或者等效地,A 被 B 使用。要链接模块 B,模块 A 必须已经被链接;否则,A导出的符号的引用无法在B中正确链接。简而言之, 模块之间存在依赖关系。
A module (B) can refer to the symbols exported by another module (A); in this case, we say that B is loaded on top of A, or equivalently that A is used by B. To link module B, module A must have already been linked; otherwise, the references to the symbols exported by A cannot be properly linked in B. In short, there is a dependency between modules.
modules_which_use_me
A的对象的字段是module包含A使用的所有模块的依赖列表的头部;列表中的每个元素都是一个小module_use描述符,其中包含指向列表中相邻元素的指针和指向相应对象的指针module;在我们的示例中,module_use指向 Bmodule对象的描述符将出现在modules_which_use_meA 的列表modules_which_use_me中。每当在 A 之上加载模块时,必须动态更新该列表。如果模块 A 的依赖项列表不为空,则无法卸载该模块。
The modules_which_use_me
field of the module object of A is
the head of a dependency list containing all modules that are used by
A; each element of the list is a small module_use descriptor containing the
pointers to the adjacent elements in the list and a pointer to the
corresponding module object; in our
example, a module_use descriptor
pointing to the B's module object
would appear in the modules_which_use_me list of A. The modules_which_use_me list must be updated
dynamically whenever a module is loaded on top of A. The module A
cannot be unloaded if its dependency list is not empty.
当然,除了 A 和 B 之外,还可以在 B 之上加载另一个模块 (C),依此类推。堆叠模块是对内核源代码进行模块化的有效方法,从而加快其开发速度。
Beside A and B there could be, of course, another module (C) loaded on top of B, and so on. Stacking modules is an effective way to modularize the kernel source code, thus speeding up its development.
用户可以通过执行insmod外部程序将模块链接到正在运行的内核中。该程序执行以下操作:
A user can link a module into the running kernel by executing the insmod external program. This program performs the following operations:
从命令行读取要链接的模块的名称。
Reads from the command line the name of the module to be linked.
在系统目录树中找到包含模块目标代码的文件。该文件通常放置在/lib/modules下面的某个子目录中。
Locates the file containing the module's object code in the system directory tree. The file is usually placed in some subdirectory below /lib/modules.
从磁盘读取包含模块目标代码的文件。
Reads from disk the file containing the module's object code.
调用init_module( )
系统调用,向其传递包含模块目标代码的用户模式缓冲区的地址、目标代码的长度以及包含 insmod 程序参数的用户模式内存
区域。
Invokes the init_module( )
system call, passing to it the address of the User Mode buffer
containing the module's object code, the length of the object code,
and the User Mode memory area containing the parameters of the
insmod program.
终止。
Terminates.
服务sys_init_module( )例程完成所有实际工作;它执行以下主要操作:
The sys_init_module( ) service
routine does all the real work; it performs the following main
operations:
检查用户是否允许链接模块(当前进程必须有能力CAP_SYS_MODULE)。在向内核添加功能(内核可以访问系统上的所有数据和进程)的每种情况下,安全性都是最重要的问题。
Checks whether the user is allowed to link the module (the
current process must have the CAP_SYS_MODULE capability). In every
situation where one is adding functionality to a kernel, which has
access to all data and processes on the system, security is a
paramount concern.
为模块的目标代码分配临时内存区域;然后,将作为系统调用的第一个参数传递的用户模式缓冲区中的数据复制到该内存区域。
Allocates a temporary memory area for the module's object code; then, copies into this memory area the data in the User Mode buffer passed as first parameter of the system call.
检查内存区域中的数据是否有效地表示模块的 ELF 对象;否则,返回错误代码。
Checks that the data in the memory area effectively represents a module's ELF object; otherwise, returns an error code.
为传递给insmod程序的参数分配一个内存区域 ,并用用户模式缓冲区中的数据填充该区域,该缓冲区的地址作为系统调用的第三个参数传递。
Allocates a memory area for the parameters passed to the insmod program, and fills it with the data in the User Mode buffer whose address was passed as third parameter of the system call.
遍历modules列表以验证该模块尚未链接。检查是通过比较模块的名称(对象name中的字段module)来完成的。
Walks the modules list to
verify that the module is not already linked. The check is done by
comparing the names of the modules (name field in the module objects).
为模块的核心可执行代码分配一块内存区域,并用模块相关部分的内容填充它。
Allocates a memory area for the core executable code of the module, and fills it with the contents of the relevant sections of the module.
为模块的初始化代码分配一块内存区域,并用模块相关部分的内容填充它。
Allocates a memory area for the initialization code of the module, and fills it with the contents of the relevant sections of the module.
module确定新模块的对象地址。该对象的图像包含在gnu.linkonce.this_module模块 ELF 文件的文本段部分中。该module对象因此包含在步骤 6 填充的内存区域中。
Determines the address of the module object for the new module. An image
of this object is included in the gnu.linkonce.this_module section of the
text segment of the module's ELF file. The module object is thus included in the
memory area filled in step 6.
将步骤 6 和 7 中分配的内存区域的地址存储在对象的module_code
和module_init字段
中。module
Stores in the module_code
and module_init fields of the
module object the addresses of
the memory areas allocated in steps 6 and 7.
初始化对象modules_which_use_me中的列表module,并将除执行 CPU 的计数器设置为 1 之外的所有模块的引用计数器设置为零。
Initializes the modules_which_use_me list in the module object, and sets to zero all
module's reference counters except the counter of the executing CPU,
which is set to one.
根据模块对象中指定的许可证类型设置对象license_gplok中的标志。module
Sets the license_gplok flag
in the module object according to
the type of license specified in the module object.
使用内核符号表和模块符号表重新定位模块的目标代码。这意味着用相应的逻辑地址偏移量替换所有出现的外部和全局符号。
Using the kernel symbol tables and the module symbol tables, relocates the module's object code. This means replacing all occurrences of external and global symbols with the corresponding logical address offsets.
初始化对象的syms和
字段,以便它们指向模块导出的内存中符号表。gpl_symsmodule
Initializes the syms and
gpl_syms fields of the module object so that they point to the
in-memory tables of symbols exported by the module.
模块的异常表(参见第10章“异常表”部分)包含在
模块的ELF文件部分中,因此它被复制到步骤6中分配的内存区域中:将其地址存储在字段中物体。_ _ex_tableextablemodule
The exception table of the module (see the section "The Exception Tables"
in Chapter 10) is
contained in the _ _ex_table
section of the module's ELF file, thus it was copied into the memory
area allocated in step 6: stores its address in the extable field of the module object.
解析insmod程序的参数,并相应地设置相应模块变量的值。
Parses the arguments of the insmod program, and sets the value of the corresponding module variables accordingly.
mkobj注册对象字段中包含的 kobject module,以便该模块的新子目录出现在sysfs的模块目录中 特殊文件系统(参见第 13 章中的“ Kobjects ”部分)。
Registers the kobject included in the mkobj field of the module object so that a new sub-directory
for the module appears in the module directory of the sysfs special filesystem (see the section "Kobjects" in Chapter 13).
释放在步骤 2 中分配的临时内存区域。
Frees the temporary memory area allocated in step 2.
将module对象添加到列表中modules。
Adds the module object in
the modules list.
将模块的状态设置为MODULE_STATE_COMING。
Sets the state of the module to MODULE_STATE_COMING.
如果定义,则执行init该module对象的方法。
If defined, executes the init method of the module object.
将模块的状态设置为MODULE_STATE_LIVE。
Sets the state of the module to MODULE_STATE_LIVE.
通过返回零终止(成功)。
Terminates by returning zero (success).
要取消链接模块,用户调用rmmod外部程序,该程序执行以下操作:
To unlink a module, a user invokes the rmmod external program, which performs the following operations:
从命令行读取要取消链接的模块的名称。
Reads from the command line the name of the module to be unlinked.
打开/proc/modules 文件,其中列出了链接到内核的所有模块,并检查要删除的模块是否已有效链接。
Opens the /proc/modules file, which lists all modules linked into the kernel, and checks that the module to be removed is effectively linked.
Invokes the delete_module(
) system call passing to it the name of the
module.
终止。
Terminates.
服务例程依次sys_delete_module(
)执行以下主要操作:
In turn, the sys_delete_module(
) service routine performs the following main
operations:
检查是否允许用户取消链接模块(当前进程必须有能力CAP_SYS_MODULE)。
Checks whether the user is allowed to unlink the module (the
current process must have the CAP_SYS_MODULE capability).
将模块的名称复制到内核缓冲区中。
Copies the module's name in a kernel buffer.
遍历modules列表以查找module模块的对象。
Walks the modules list to
find the module object of the
module.
检查modules_which_use_me模块的依赖列表;如果不为空,则函数返回错误代码。
Checks the modules_which_use_me dependency list of
the module; if it is not empty, the function returns an error
code.
检查模块的状态;如果不是MODULE_STATE_LIVE,则返回错误代码。
Checks the state of the module; if it is not MODULE_STATE_LIVE, returns an error
code.
如果模块有自定义init方法,该函数会检查它是否也有自定义exit方法;如果没有exit定义方法,则不应卸载模块,从而返回退出代码。
If the module has a custom init method, the function checks that it
has also a custom exit method; if
no exit method is defined, the
module should not be unloaded, thus returns an exit code.
为了避免竞争情况,停止系统中除执行sys_delete_module( )服务例程的 CPU 之外的所有 CPU 的活动。
To avoid race conditions, stops the activities of all CPUs in
the system, except the CPU executing the sys_delete_module( ) service
routine.
将模块的状态设置为MODULE_STATE_GOING。
Sets the state of the module to MODULE_STATE_GOING.
如果模块所有引用计数器的总和大于零,则返回错误代码。
If the sum of all reference counters of the module is greater than zero, returns an error code.
如果定义,则执行exit模块的方法。
If defined, executes the exit method of the module.
module从列表中删除对象,并从sysfsmodules特殊文件系统中取消注册该模块。
Removes the module object
from the modules list, and
de-registers the module from the sysfs special filesystem.
module从它正在使用的模块的依赖项列表中删除该对象。
Removes the module object
from the dependency lists of the modules that it was using.
释放包含模块的可执行代码、module对象以及各种符号和异常表的内存区域。
Frees the memory areas that contain the module's executable
code, the module object, and the
various symbol and exception tables.
返回零(成功)。
Returns zero (success).
当请求模块提供的功能时,可以自动链接模块,然后自动删除模块。
A module can be automatically linked when the functionality it provides is requested and automatically removed afterward.
例如,假设 MS-DOS文件系统尚未静态或动态链接。如果用户尝试挂载 MS-DOS 文件系统,mount( )系统调用通常会失败并返回错误代码,因为 MS-DOS 不包含在file_systems已注册文件系统列表中。但是,如果在配置内核时指定了对模块自动链接的支持,Linux 将尝试链接 MS-DOS 模块,然后再次扫描已注册的文件系统列表。如果模块成功链接,mount(
)系统调用可以继续执行,就好像 MS-DOS 文件系统从一开始就存在一样。
For instance, suppose that the MS-DOS filesystem has not been linked, either statically or
dynamically. If a user tries to mount an MS-DOS filesystem, the mount( ) system call normally fails by
returning an error code, because MS-DOS is not included in the file_systems list of registered filesystems.
However, if support for automatic linking of modules has been specified
when configuring the kernel, Linux makes an attempt to link the MS-DOS
module, and then scans the list of registered filesystems again. If the
module is successfully linked, the mount(
) system call can continue its execution as if the MS-DOS
filesystem were present from the beginning.
为了自动链接模块,内核创建一个内核线程来执行modprobe 外部程序,[ * ]它会处理由于模块依赖性而可能出现的复杂情况。前面已经讨论过依赖关系:一个模块可能需要一个或多个其他模块,而这些模块又可能需要其他模块。例如,MS-DOS 模块需要另一个名为fat的模块,其中包含一些基于文件分配表 (FAT) 的所有文件系统通用的代码。因此,如果它尚不存在,脂肪 当请求 MS-DOS 模块时,模块还必须自动链接到正在运行的内核中。解决依赖关系和查找模块是一种最好在用户模式下完成的活动,因为它需要在文件系统中查找和访问模块对象文件。
To automatically link a module, the kernel creates a kernel thread to execute the modprobe external program,[*] which takes care of possible complications due to module dependencies. The dependencies were discussed earlier: a module may require one or more other modules, and these in turn may require still other modules. For instance, the MS-DOS module requires another module named fat containing some code common to all filesystems based on a File Allocation Table (FAT). Thus, if it is not already present, the fat module must also be automatically linked into the running kernel when the MS-DOS module is requested. Resolving dependencies and finding modules is a type of activity that's best done in User Mode, because it requires locating and accessing module object files in the filesystem.
modprobe外部程序与insmod类似,因为它链接到命令行上指定的模块。但是,modprobe还会递归链接命令行上指定的模块所使用的所有模块。例如,如果用户调用modprobe来链接 MS-DOS 模块,则程序将链接fat模块(如有必要),然后链接 MS-DOS 模块。实际上,modprobe只是检查模块依赖关系;每个模块的实际链接是通过分叉一个新进程并执行insmod来完成的。
The modprobe external program is similar to insmod, since it links in a module specified on the command line. However, modprobe also recursively links in all modules used by the module specified on the command line. For instance, if a user invokes modprobe to link the MS-DOS module, the program links the fat module, if necessary, followed by the MS-DOS module. Actually, modprobe simply checks for module dependencies; the actual linking of each module is done by forking a new process and executing insmod.
modprobe如何了解模块依赖关系?另一个名为depmod的外部程序在系统启动时执行。它查看为正在运行的内核编译的所有模块,这些模块通常存储在/lib/modules目录中。然后它将所有模块依赖项写入名为modules.dep的文件中。因此, modprobe程序可以简单地将文件中存储的信息与/proc生成的链接模块列表进行比较/模块文件。
How does modprobe know about module dependencies? Another external program named depmod is executed at system startup. It looks at all the modules compiled for the running kernel, which are usually stored inside the /lib/modules directory. Then it writes all module dependencies to a file named modules.dep. The modprobe program can thus simply compare the information stored in the file with the list of linked modules yielded by the /proc /modules file.
在某些情况下,内核可能会调用该request_module( )函数来尝试自动链接模块。
In some cases, the kernel may invoke the request_module( ) function to attempt
automatic linking for a module.
再次考虑用户尝试安装 MS-DOS 的情况文件系统。如果该get_fs_type( )函数发现文件系统未注册,则会调用该request_module( )函数,希望 MS-DOS 已被编译为模块。
Consider again the case of a user trying to mount an
MS-DOS filesystem. If the get_fs_type( ) function discovers that the
filesystem is not registered, it invokes the request_module( ) function in the hope that
MS-DOS has been compiled as a module.
如果该request_module( )
函数成功链接所请求的模块,则get_fs_type( )可以继续,就像该模块始终存在一样。当然,这种情况并不总是发生。在我们的示例中,MS-DOS 模块可能根本没有被编译。在这种情况下,get_fs_type( )返回错误代码。
If the request_module( )
function succeeds in linking the requested module, get_fs_type( ) can continue as if the module
were always present. Of course, this does not always happen; in our
example, the MS-DOS module might not have been compiled at all. In
this case, get_fs_type( ) returns
an error code.
该request_module( )
函数接收要链接的模块的名称作为其参数。它执行kernel_thread(
)以创建一个新的内核线程并等待该内核线程终止。
The request_module( )
function receives the name of the module to be linked as its
parameter. It executes kernel_thread(
) to create a new kernel thread and waits until that kernel
thread terminates.
内核线程依次接收要链接的模块的名称作为其参数,并调用系统execve( )调用来执行
modprobe外部程序,[ * ]将模块名称传递给它。反过来,modprobe程序实际上链接所请求的模块及其依赖的任何模块。
The kernel thread, in turn, receives the name of the module to
be linked as its parameter and invokes the execve( ) system call to execute the
modprobe external
program,[*] passing the module name to it. In turn, the modprobe program actually links the
requested module, along with any that it depends on.
MJ 巴赫《Unix 操作系统的设计》。Prentice Hall International, Inc.,1986。描述 SVR2 内核的经典书籍。
Bach, M. J. The Design of the Unix Operating System. Prentice Hall International, Inc., 1986. A classic book describing the SVR2 kernel.
古德哈特 (B. Goodheart) 和 J. 考克斯 (J. Cox)。魔法花园解释:Unix System V Release 4 的内部结构。Prentice Hall International, Inc.,1994。一本关于 SVR4 内核的优秀书籍。
Goodheart, B. and J. Cox. The Magic Garden Explained: The Internals of the Unix System V Release 4. Prentice Hall International, Inc., 1994. An excellent book on the SVR4 kernel.
Mauro, J. 和 R. McDougall。Solaris 内部结构:核心内核架构。Prentice Hall,2000。对 Solaris 内核的很好的介绍。
Mauro, J. and R. McDougall. Solaris Internals: Core Kernel Architecture. Prentice Hall, 2000. A good introduction to the Solaris kernel.
麦克库西克 (MK)、MJ 卡雷斯 (MJ Karels) 和 K. 博斯蒂克 (K. Bostic)。4.4 BSD操作系统的设计与实现。Addison Wesley,1986。也许是关于 4.4 BSD 内核最权威的书。
McKusick, M. K., M. J. Karels, and K. Bostic. The Design and Implementation of the 4.4 BSD Operating System. Addison Wesley, 1986. Perhaps the most authoritative book on the 4.4 BSD kernel.
希梅尔,柯特。现代架构的 UNIX 系统:内核程序员的对称多处理和缓存。Addison-Wesley,1994。一本有趣的书,主要讨论多处理器系统中的缓存实现问题。
Schimmel, Curt. UNIX Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers. Addison-Wesley, 1994. An interesting book that deals mostly with the problem of cache implementation in multiprocessor systems.
Vahalia,U. Unix 内部结构:新前沿。Prentice Hall, Inc.,1996。一本有价值的书,提供了有关现代 Unix 内核设计问题的大量见解。它包括丰富的参考书目。
Vahalia, U. Unix Internals: The New Frontiers. Prentice Hall, Inc., 1996. A valuable book that provides plenty of insight on modern Unix kernel design issues. It includes a rich bibliography.
Beck, M.、H. Boehme、M. Dziadzka、U. Kunitz、R. Magnus、D. Verworner 和 C. Schroter。Linux 内核编程(第三版)。Addison Wesley,2002 年。一本涵盖 Linux 2.4 内核的独立于硬件的书。
Beck, M., H. Boehme, M. Dziadzka, U. Kunitz, R. Magnus, D. Verworner, and C. Schroter. Linux Kernel Programming (3rd ed.). Addison Wesley, 2002. A hardware-independent book covering the Linux 2.4 kernel.
本韦努蒂,克里斯蒂安. 了解 Linux 网络内部结构。O'Reilly Media, Inc.,2006。涵盖标准网络协议和 Linux 实现的细节,重点关注第 2 层和第 3 层活动。
Benvenuti, Christian. Understanding Linux Network Internals. O'Reilly Media, Inc., 2006. Covers standard networking protocols and the details of Linux implementation, with a focus on layer 2 and 3 activities.
Corbet, J.、A. Rubini 和 G. Kroah-Hartman。Linux 设备驱动程序(第三版)。O'Reilly & Associates, Inc.,2005。一本有价值的书,在某种程度上是对本书的补充。它提供了大量有关如何开发 Linux 驱动程序的信息。
Corbet, J., A. Rubini, and G. Kroah-Hartman. Linux Device Drivers (3rd ed.). O'Reilly & Associates, Inc., 2005. A valuable book that is somewhat complementary to this one. It gives plenty of information on how to develop drivers for Linux.
Gorman, M.了解 Linux 虚拟内存管理器。Prentice Hall PTR,2004 年。重点关注本书中包含的章节的子集,即与虚拟内存管理器相关的章节。
Gorman, M. Understanding the Linux Virtual Memory Manager. Prentice Hall PTR, 2004. Focuses on a subset of the chapters included in this book, namely those related to the Virtual Memory Manager.
Herbert, TF Linux TCP/IP 堆栈:嵌入式系统网络(网络系列)。Charles River Media,2004。详细描述了 2.6 内核中的 TCP/IP Linux 实现。
Herbert, T. F. The Linux TCP/IP Stack: Networking for Embedded Systems (Networking Series). Charles River Media, 2004. Describes in great details the TCP/IP Linux implementation in the 2.6 kernel.
Love, R. Linux 内核开发(第二版)。Novell Press,2005。一本涉及 Linux 2.6 内核的独立于硬件的书籍。一些读者建议在攻击这本书之前先读一下它。
Love, R. Linux Kernel Development (2nd ed.). Novell Press, 2005. A hardware-independent book covering the Linux 2.6 kernel. Some readers suggest to read it before attacking this book.
Mosberger, D.、S. Eranian 和 B. Perens。IA-64 Linux 内核:设计与实现。Prentice Hall, Inc.,2002。对 Itanium IA-64 微处理器的依赖于硬件的 Linux 内核的精彩描述。
Mosberger, D., S. Eranian, and B. Perens. IA-64 Linux Kernel: Design and Implementation. Prentice Hall, Inc., 2002. An excellent description of the hardware-dependent Linux kernel for the Itanium IA-64 microprocessor.
英特尔。英特尔架构软件开发人员手册,卷。3:系统编程指南。2005 年。描述 Intel Pentium 微处理器架构。可以从以下网址下载:http://developer.intel.com/design/processors/pentium4/manuals/25366816.pdf。
Intel. Intel Architecture Software Developer's Manual, vol. 3: System Programming Guide. 2005. Describes the Intel Pentium microprocessor architecture. It can be downloaded from: http://developer.intel.com/design/processors/pentium4/manuals/25366816.pdf.
英特尔。多处理器规范,版本 1.4。1997 年。描述了 Intel 多处理器架构规范。可以从http://www.intel.com/design/pentium/datashts/24201606.pdf下载。
Intel. MultiProcessor Specification, Version 1.4. 1997. Describes the Intel multiprocessor architecture specifications. It can be downloaded from http://www.intel.com/design/pentium/datashts/24201606.pdf.
Messmer,HP 《不可或缺的 PC 硬件手册》(第 4 版)。Addison Wesley Professional,2001。一份有价值的参考资料,详尽地描述了 PC 的许多组件。
Messmer, H. P. The Indispensable PC Hardware Book (4th ed.). Addison Wesley Professional, 2001. A valuable reference that exhaustively describes the many components of a PC.
获取内核源代码的官方站点可以在http://www.kernel.org找到。 世界各地还有许多镜像站点。
http://lxr.linux.no上提供了一个有价值的 Linux 2.6 源代码搜索引擎。
The official site for getting kernel source can be found at http://www.kernel.org.Many mirror sites are also available all over the world.
A valuable search engine for the Linux 2.6 source code is available at http://lxr.linux.no.
GNU C 编译器的所有发行版都应包含其所有功能的完整文档,存储在多个可以使用 Emacs 程序或信息阅读器读取的信息文件中。顺便说一句,有关扩展内联汇编的信息很难理解,因为它没有提及任何特定的体系结构。有关 80 × 86 GCC 的内联汇编和Gas (GCC 调用的 GNU 汇编器)的一些相关信息 可以在以下位置找到:
| http://www.delorie.com/djgpp/doc/brennan/brennan_att_inline_djgpp.html |
| http://www.ibm.com/developerworks/linux/library/l-ia.html |
| http://www.gnu.org/manual/gas-2.9.1/as.html |
All distributions of the GNU C compiler should include full documentation for all its features, stored in several info files that can be read with the Emacs program or an info reader. By the way, the information on Extended Inline Assembly is quite hard to follow, because it does not refer to any specific architecture. Some pertinent information about 80 × 86 GCC's Inline Assembly and gas, the GNU assembler invoked by GCC, can be found at:
| http://www.delorie.com/djgpp/doc/brennan/brennan_att_inline_djgpp.html |
| http://www.ibm.com/developerworks/linux/library/l-ia.html |
| http://www.gnu.org/manual/gas-2.9.1/as.html |
该网站 ( http://www.tldp.org ) 包含 Linux 文档项目的主页,该主页又包含一些有趣的指南、常见问题解答和 HOWTO 参考资料。
The web site (http://www.tldp.org) contains the home page of the Linux Documentation Project, which, in turn, includes several interesting references to guides, FAQs, and HOWTOs.
新闻组comp.os.linux.development.system致力于讨论 Linux 系统的开发。
The newsgroup comp.os.linux.development.system is dedicated to discussions about development of the Linux system.
这个引人入胜的邮件列表包含很多噪音以及一些有关 Linux 当前开发版本的相关评论,以及有关在内核中包含或不包含某些更改建议的基本原理的评论。这是一个正在形成的新想法的活生生的实验室。邮件列表的名称是linux-kernel@vger.kernel.org。
This fascinating mailing list contains much noise as well as a few pertinent comments about the current development version of Linux and about the rationale for including or not including in the kernel some proposals for changes. It is a living laboratory of new ideas that are taking shape. The name of the mailing list is linux-kernel@vger.kernel.org.
这本 200 页的书由 David A. Rusling 撰写,可以在http://www.tldp.org/LDP/tlk/tlk.html上查看,其中描述了 Linux 2.0 内核的一些基本方面。
Authored by David A. Rusling, this 200-page book can be viewed at http://www.tldp.org/LDP/tlk/tlk.html, and describes some fundamental aspects of the Linux 2.0 kernel.
http://www.safe-mbox.com/~rgooch/linux/docs/vfs.txt页面 介绍了 Linux 虚拟文件系统。作者是理查德·古奇。
The page at http://www.safe-mbox.com/~rgooch/linux/docs/vfs.txt is an introduction to the Linux Virtual File System. The author is Richard Gooch.
我们在这里列出了本书中提到的几篇论文。不用说,还有很多其他对Linux的发展有很大影响的文章。
We list here a few papers that we have mentioned in this book. Needless to say, there are many other articles that have a great impact on the development of Linux.
McCreight, E.“优先级搜索树”,SIAM J. Comput .,卷。14,第 2 期,第 257–276 页,1985 年 5 月
McCreight, E. "Priority Search Tree," SIAM J. Comput., Vol. 14, No 2, pp. 257–276, May 1985
约翰逊,T. 和 D. 莎莎。“2Q:低开销高性能缓冲区管理替换算法”,第 20 届 IEEE VLDB 会议论文集,智利圣地亚哥,1994 年,第 439-450 页。
Johnson, T. and D. Shasha. "2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm," Proceedings of the 20th IEEE VLDB Conf., Santiago, Chile, 1994, pp. 439–450.
Bonwick, J.“Slab 分配器:对象缓存内核内存分配器”,1994 年夏季 USENIX 会议记录,第 87-98 页。
Bonwick, J. "The Slab Allocator: An Object-Caching Kernel Memory Allocator," Proceedings of Summer 1994 USENIX Conference, pp. 87–98.
我们的外观是读者评论、我们自己的实验以及发行渠道反馈的结果。独特的封面补充了我们处理技术主题的独特方法,为潜在的枯燥主题注入个性和生命力。
Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects.
Darren Kelly 是《Understanding the Linux Kernel , Third Edition》的制作编辑。莎朗·伦兹福德 (Sharon Lundsford) 是文案编辑,朱莉·坎贝尔 (Julie Campbell) 是校对员。Mary Brady 和 Claire Cloutier 负责质量控制。Jansen Fernald 和 Loranah Dimant 提供了制作协助。艾米·帕克提供制作服务。
Darren Kelly was the production editor for Understanding the Linux Kernel, Third Edition. Sharon Lundsford was the copyeditor and Julie Campbell was the proofreader. Mary Brady and Claire Cloutier provided quality control. Jansen Fernald and Loranah Dimant provided production assistance. Amy Parker provided production services.
埃迪·弗里德曼 (Edie Freedman) 根据她自己和汉娜·戴尔 (Hanna Dyer) 设计的系列作品设计了这本书的封面。封面上的一个男人和一个泡泡是多佛图画档案馆的 19 世纪版画。Karen Montgomery 使用 QuarkXPress 4.1 使用 Adobe 的 ITC Garamond 字体制作了封面布局。
Edie Freedman designed the cover of this book, based on a series design by herself and Hanna Dyer. The cover image of a man with a bubble is a 19th-century engraving from the Dover Pictorial Archive. Karen Montgomery produced the cover layout with QuarkXPress 4.1 using Adobe's ITC Garamond font.
大卫·富塔托 (David Futato) 设计了室内布局。章节开头图片来自多佛图画档案馆。Keith Fahlgren 使用 Erik Ray、Jason McIntosh、Neil Walls 和 Mike Sierra 创建的使用 Perl 和 XML 技术的格式转换工具将本书转换为 FrameMaker 5.5.6。文字字体为Linotype Birka;标题字体为 Adobe Myriad Condensed;代码字体是 LucasFont 的 TheSans Mono Condensed。书中出现的插图由 Robert Romano、Jessamyn Read 和 Lesley Borash 使用 Macromedia FreeHand 9 和 Adobe Photoshop 6 制作。
David Futato designed the interior layout. The chapter opening image is from the Dover Pictorial Archive. This book was converted to FrameMaker 5.5.6 by Keith Fahlgren with a format conversion tool created by Erik Ray, Jason McIntosh, Neil Walls, and Mike Sierra that uses Perl and XML technologies. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert Romano, Jessamyn Read, and Lesley Borash using Macromedia FreeHand 9 and Adobe Photoshop 6.